Capturing document images with the help of handheld devices is the norm today. These images can easily be transferred over mail, chat, and can also be stored in the hard drive for future reference.
However, casual capturing of documents often leaves behind folds and distortions that are hard to recover from using normal editing processes.
Quite a number of image processing and deep learning algorithms have been proposed to convert document images back into their original structures.
Image processing algorithms (though enough on their own in dewarping rectilinear faults in document images) often fall short of expectations when delivering on document pictures that have folds or curves. Some of the earliest proposed image processing algorithms for dealing with document images were way back in 2000.
When we think of the variety of folded documents that might be used as input to the model, we come to an easy conclusion that image processing for document dewarping is a naive approach and can’t be used to model dewarps in even half of the documents it sees.
Today, most of the algorithms covering document dewarping using computers rely on deep learning methods.
Deep learning, however, is not without its issues. We find that one of the biggest issues of using deep learning to dewarp documents is that there is no ground truth for dewarping the document.
We might think that scanned document images can be used as ground truth for predictions of dewarps, but as CNNs follow a multiplicative approach, we can’t expect the network to shift pixels and create dewarps by skewing or stretching images.
CNN-based methods were first employed by Das et. al where the use of the network was limited to the identification of paper creases. This worked in a lot of cases, but failed majorly in some.
As can be seen in the image (taken from ’s supplementary folder here), we find that the method fails majorly and sometimes creates irregularities in the documents it dewarps. This further strengthens the thought process that the deep learning based method has to be used and usage of image processing or simple optimization would break the dewarps, as the search space is just too vast for this problem. As we observe, Das et al. breaks even with four folds in the image.
Arguably one of the first methods to bring forth an end to end model for dewarping document images was proposed by Ma et al.  in DocUNet. Ma not only proposed DocUNet, but also brought forth a benchmark and a method of generating dewarped document images from ground truth data.
In DocUNet, the generation of warped document images starts from the generation of an MxM perturbed mesh. These perturbed meshes are formed from random points selected and deformation of varying degrees applied on them. Each mesh is then used on flatbed scans of documents and dewarped images are generated in this method. The dewarping grid in this case is found by inverting the mesh. The inverted mesh can dewarp the images generated into their original scanned forms. The original manuscript of DocUNet claims that they built a huge dataset of 100k images with this method of warping.
Thus, we find that DocuNet solves one of the biggest problems deep learning algorithms have been facing by giving a ground truth for training the model against — the dewarping grid. The images generated by the methods proposed by DocuNet do not have much connection to the real world as they are distorted on a 2D grid and have only two dimensional warps in them. Better data generation methods have been proposed since then, which perform better when it comes to helping the model generalize to real world data and images.
The work by Ma et al. not only gives us a method of generating dewarped document images, but also proposes a method to dewarp the images. The DocuNet network proposed by Ma has a stacked U-Net architecture with a small concatenation in the stack. The complete architecture as taken from the original paper is below:
The network splits and produces a forward mapping y1 from the output of the first U-Net. The same loss applied at y2 is also applied at y1.
Then y1 is concatenated with the output feature map of the first U-Net and serves as the input for the second U-Net. © represents the concatenation operator. y2 can be directly used to generate the rectified image.
The loss is defined as a combination of element wise loss and shift invariant loss. To be more specific, the element wise loss is an L2 loss that helps the model learn the grid accurately, while the shift invariant loss makes sure that the difference between the values of two random pixels (i,j) and (k,l) are close for the ground truth and the generated grid.
Element wise loss:
Shift Invariant Loss:
Where y denotes values at the predicted grid and y* denotes the ground truth.
The results of  can be visualized below:
DocuNet, being the first end-to-end deep learning model proposed, sets up a benchmark for future models to improve upon. 130 images of 65 pages are taken at various angles and released as a benchmark dataset.
Metrics to evaluate dewarp methods like MS-SSIM (Multi Scale Structural Similarity Index) and LD (Local Distortion) are used and the results on these metrics are published.
Following the recent surge of deep learning works in the domain of document dewarping, Das et al. suggested a method of data generation significantly better as compared to DocuNet . Compared to the 2D mesh proposed by DocuNet, which generated 2d warps, the method proposed in  used 3D meshes to perform 3D warps.
Essentially, the method involved capturing the warps in real documents at various angle to get a 3D mesh and augmenting the mesh through rotations, crops, and resizes to get a variety of meshes. These meshes are then applied to scanned images to generate warped images and their corresponding dewarping ground truth.
Along with the backward mapping that helps to dewarp the document, the method also provided additional information on the 2D document image in the form of 3D depth maps, normal maps, and so on. The dataset generated is called doc3D and can be downloaded here.
Along with a method of data generation, Das et al. proposed a neural network architecture to dewarp document images, using backward mapping and the 3D coordinate map as ground truths for training.
The architecture, similar to DocUNet, is a stacked encoder decoder architecture, training of which takes place in three stages. In the first stage, the first encoder decoder of the stack, a U-Net termed as the shape network, regresses the 3D shape of the document from the document image.
The next stage consists of the secondary network of the stack, an encoder-decoder network with Dense-Net blocks, which transforms the 3D coordinates to texture coordinates.
The final stage has a refinement network that serves as a post processing module, fixing lighting and other issues with the dewarped document image.
The refinement module consists of two U-Net style encoder decoder blocks, one of which is used to predict surface normals, while the other predicts shading maps. A comprehensive figure detailing the architecture of the network is provided below:
Here C and B are the ground-truth for the 3D coordinates and the backward mapping respectively
The training of the network takes place in two phases. In the first phase, the shape network and the texture mapping network are trained individually. The shape network is trained with the accumulated loss expressed as:
Where ∇C=||(∇ₓC,∇ᵧC)||₂ represents the horizontal and vertical image gradients of the 3D coordinate map C. Image gradients help the model learn specifically about ridges and valleys in the coordinate map C.
The texture network is trained to minimize the difference between the backward mapping and the generated backward grid. The mathematical form of the loss is expressed below:
With the first term being the L1-loss on the predicted backward mapping vs the original map, while the second term is the L2-loss between the predicted unwarp of the document and the original unwarp, expressed as the reconstruction loss.
In the second phase of training, both the shape and the texture network are trained simultaneously with their losses linearly weighted and added up. The loss for the network is then expressed as:
Results and Comparisons
Comparison with DocUNet is provided in the DewarpNet paper. Comparison is made not only on MS-SSIM and LD, but also on OCR with metrics like Edit Distance and CER. The tables for the same are provided below:
- S. Das, G. Mishra, A. Sudharshana, and R. Shilkrot, “The common fold:utilizing the four-fold to dewarp printed documents from a single image,”inProceedings of the 2017 ACM Symposium on Document Engineering,2017, pp. 125–128.
- X. B. J. W. D. S. Ke Ma, Zhixin Shu, “Docunet: Document imageunwarping via a stacked u-net,” inProceedings of IEEE Conference onComputer Vision and Pattern Recognition, 2018.
- S. Das, K. Ma, Z. Shu, D. Samaras, and R. Shilkrot, “Dewarpnet:Single-image document unwarping with stacked 3d and 2d regressionnetworks,” inProceedings of the IEEE International Conference onComputer Vision, 2019, pp. 131–140