Research Blog

Review of recent work by the broader Computer Vision commnunity

Home   >   Research Blog   >   Better Image Inpainting: Structure and Texture Go Hand in Hand

Better Image Inpainting: Structure and Texture Go Hand in Hand

Posted by Arjun on Fri, 18 Aug 2023 Better Image Inpainting Review

Traditionally Deep image inpainting was done using diffusion-based or patch-based methods. These techniques rely on reusing textures and colors from the same image, producing good enough textures but mostly failing to generate definite structures. On the other hand, GAN based approaches produce good semantic structures but they lack textural information. LaMa used fast fourier convolutions to increase the receptive field for generating repeating patterns. LaMa produces remarkable repeating textures, but results in fading out structures when holes get larger. In this article we’ll discuss a state-of-the-art Inpainting method, which combines fast fourier convolutions and GAN to synthesize good structures and textures with a single network.

What is Image Inpainting?

Deep image inpainting is a computer vision technique used to fill in missing or corrupted parts of an image with plausible content. The inpainting process involves feeding an incomplete image into the trained model, which generates a completed version of the image by filling in the missing regions. The model utilizes the surrounding context and learned patterns to infer the content of the missing regions and generate visually plausible results.Deep image inpainting has found applications in various domains, including digital photography, image editing software, film restoration, and even forensics.


Keys to Better Image Inpainting: Structure and Texture Go Hand in Hand

Jitesh Jain, Yuqian Zhou, Ning Yu, Humphrey Shi

SHI Lab, University of Oregon, IIT Roorkee, Picsart AI Research (PAIR), Adobe Inc., Salesforce Research

Proceedings of the IEEE/CVF Conference on Winter Conference on Applications of Computer Vision (WACV), 2023

WACV official version


Fig. 1: (a) Proposed Inpainting Framework. (b) The architecture of the FaF Synthesis (FaF-Syn) module inside the generator. (c) The architecture of FaF-Res Block.

RGB \(I_{hole}\) is concatenated with Hole Mask \(M\), where \(I_{hole} = I_{org} \circ M\) where \(\circ\) represents element-wise multiplication. Four channel input is fed to the encoder to produce a latent encoder vector \(z_{enc}\).

Generator is inspired from styleGAN2.Generator consists of FaF Synthesis Module. Similar to CoModGAN, a random noise latent vector is generated and is passed to Mapping Network to obtain \(z_{w}\) . \(z_w\) is concatenated with \(z_{enc}\) and fed to the Generator which produces the Inpainted Image \(I_{comp}\).

Fourier Coarse-to-Fine (FcF) Generator

Generator integrates the idea of LaMa, fast fourier convolutions residual blocks and co-modulated StyleGAN2 based coarse-to-fine generator. Generator Uses FaF Synthesis (FaF-Syn) module and each FaF-Syn contains a FaF-Res Block. Each FaF-Res block contains two Fast Fourier Convolutional(FFC) layers.

Fast Fourier Convolutional Residual Blocks (FaF-Res)

Fig. 2: FFC Layer Framework and Visualization of Inverse FFT2d Features.

Each FaF-Res Block contains 2 FFC layers(Fig. 1 (c)). The FFC layer is based on channel wise fast fourier transform (FFT). It splits the channel into two halves, Local and global. The Local branch is used to capture spatial details using Convolutions and the Global branch is used to consider global structure and capture long-range context. The Spectral Transform uses two Fourier Units (FU). A Fourier Unit mainly breaks down the spatial structure into image frequencies using a Real FFT2D operation, a convolution operation in the frequency domain and finally recovering the structure using an Inverse FFT2D operation. Spectral transform is responsible for capturing the global and semi-global information. The left Fourier Unit (FU) models the global context. On the other hand, the Local Fourier Unit (LFU) on the right side takes in a quarter of the channels and focuses on the semi-global information in the image. Fourier Units (FU) are able to produce good textures, because after Inverse FFT2d operation, the output doesn’t correspond to complicated generated images, rather it generates global repeating textures (Fig. 2).

Fast Fourier Synthesis (FaF-Syn) Module.

FaF-Syn (Fig. 1 (b)) takes in both the encoded skip connected features \(X_{skip}\), and the features \(X_{skip}\) upsampled from the previous level in the generator. FaF-Syn obtains existing image textures from features from encoder and generated textural features from previous level in the generator. It integrates these features to produce global repeating textural features. It tries to refine the coarse-level repetitive textures at the finer levels progressively.

Other Modules

Encoder Network: Encoder used is similar to the discriminator used in StyleGAN2 but without residual skip connections. Encoder takes Hole \((M)\) and \(I_{hole}\) as input and encodes it into spatial size of \(4\times4\). Skip connections are used between Encoder and Decoder. Encoded Feature Maps are passed through a linear layer to obtain encoded latent vector \(z_{enc}\).

Mapping Network: Mapping network is a 8 layer MLP network, similar to one used in styleGan2. It is responsible for transforming the noise latent vector \((z ~ N(0, I))\) to a latent space \(z_w\). Affine Transformations are done on a stack of \(z_w\) and \(z_{enc}\) to obtain style coefficient \((s)\). \((s)\) is used to scale the weights of the convolutional layers inside the generator.

Loss Functions

Adversarial Loss: A non-saturating logistic loss along with R1 regularization is used as adversarial loss. Gradient Penalty is used to calculate final loss for Discriminator.

\[L_{reg} = E_{I_{org},M}[||\triangledown D_{\theta} (stack (M, I_{org}))||^2\]

High Receptive Field Perceptual Loss(HRFPL): Computed by calculating \(L_2\) distance between higher level features of \(I_{org}\) and \(I_{comp}\). To get High level features, Dilated ResNet-50 pretrained on ADE20K semantic segmentation, is used.

\[L_{HRFPL} = \Sigma^{P-1}_{p=0} \frac{|| \Psi^{I_{comp}}_p - \Psi^{I_{org}}_p ||_2}{N}\]

Where \(\Psi^{I_*}_p\) is the feature map of the \(p^{th}\) layer given an input \(I_*\), where \(N\) is the number of feature points in \(\Psi^{I_{org}}_p\).

Reconstruction Loss: pixel-wise reconstruction L1 loss is calculated between \(I_{comp}\) and \(I_{org}\).

\[L_{rec} = ||I_{comp} - I_{org}||_{1}\]

. To supervise the structures in training a High Receptive field perceptual loss

Total Loss:

\[L_{total} = L_{adv} + \lambda_{rec}L_{rec} + \lambda_{HRFPL}L_{HRFPL}\]

Where \(\lambda_{rec} = 10\),\(\lambda_{HRFPL} = 5\) and \($\lambda_{reg} = 5\)

Qualitative Results

Fig. 3 : Qualitative results and Comparison with other state-of-the-art methods trained on Places2 dataset.

As we can see from the results, FcF is able to produce better structures and also better Textures. Rest of the Models either create a fading out effect or are not able well adapt to structures.

Fig. 4: Qualitative results and Comparison with other state-of-the-art methods trained on CelebA-HQ dataset.

LaMa produces fading out hair on the forehead and CoModGAN tries to complete the image with its own prior and is not able to understand the images well. Whereas FcF is able to produce fine images with consistent and appropriate eyes and eyebrows. All in all, it produces much better results compared to other models.

Quantitative Results

Table 1: Quantitative evaluation on Places2 and CelebA-HQ datasets. The bold text indicates the best performance, followed by red and blue fonts meaning the second and the third place.

Table 2 : Quantitative Comparisons using $ 512 \times 512 $ images on Places2 for segmentation masks

More blogs