Research Blog

Review of recent work by the broader Computer Vision commnunity

Home   >   Research Blog   >   Review of a state-of-the-art method for Image Matting

Review of a state-of-the-art method for Image Matting

Posted by Prasen on Mon, 15 Nov 2021

Image matting is the task of predicting alpha matte, which is a mask image with pixel values ranging from $$0$$ to $$1$$, to precisely extract the image foreground region. It is widely used in many image and video editing techniques, including film production and virtual background in video conferencing. This blog discusses the basics as well as one of the recent state of the art methods for a better understanding.

What is image matting?

Image matting refers to precisely separating the foreground region from the background in an image. The above relation can be mathematically formulated as,

$I = \alpha. F + (1-\alpha). B,$

where $$\alpha$$ denotes the alpha matte with each pixel value in $$[0,1]$$, and $$I$$, $$F$$, $$B$$ refer to the input image, its foreground, and background regions, respectively; all being of spatial dimensions $$H \times W$$.

Given the highly ill-posed nature of the problem and lack of large-scale labeled datasets, existing works leverage various prior-based techniques. A majority of such techniques use trimaps.

What is a trimap?

A trimap divides the image into three regions: foreground, background, and transitions (a.k.a unknown or gray) regions. The matting task is then simplified to estimate the unknown values only in the transition region, reducing the solution space. The idea is to take rough inputs from the users, marking foreground, background and the boundaries between the two coarsely.

Mask Guided Matting via Progressive Refinement Network

The Johns Hopkins University and Adobe

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

CVPR official open access versionA

All the figures are from the original paper, referenced above.

Mask Guided (MG) Matting is one of the current state-of-the-art networks in image matting. Authors of the MG Matting pointed out that an end to end prediction of alpha matte by just using a single image needs large-scale datasets. The unavailability of such datasets, may lead to an incapable model due to lack of semantic guidance that may not generalize well on unseen data in the real-world. Hence, learning-based methods require some additional guidance along with input image to precisely estimate its alpha matte. However, the specific guidance may restrict the robustness of the model against variety of priors. Majority of the existing works use trimap-based guidance. The MG Matting, on the other hand, works in a more general setting, where the guidance can be a trimap, a rough binary segmentation mask, a soft alpha matte, or any easy-to-obtain coarse mask irrespective of the user-defined or model-predicted. The overall architecture is called Progressive Refinement Network (PRN), and is based on the encoder-decoder network with a multi level architecture. The authors introduce a Progressive Refinement Module (PRM) at each feature level to selectively fuse the matting outputs from the previous level and the current level. Besides predicting the alpha matte, the method also estimates the foreground color (to allow compositing with non-opaque foreground objects, where only alpha matte is not enough). The MG Matting adopts the popularly used ResNet34-UNet with an Atrous Spatial Pyramid Pooling (ASPP) as backbone for PRN and color estimation.

The PRM, for a level $$l$$ first generates the self-guidance mask $$g_l$$ from the matting output of previous level $$\alpha_{l-1}$$ as,

$f_{\alpha_{l-1}\rightarrow g_l}(x,y) =\left\{ \begin{array}{rcl} 1 & & \mathrm{if~} {0 < \alpha_{l-1}(x,y) < 1},\\ 0 & & \mathrm{otherwise}. \end{array} \right.$

The self guidance mask is then used to preserve the confident regions from previous output features and allows current layer to only focus on refining the uncertain regions, as

$\alpha_l = \alpha^{'}_{l} g_l + \alpha_{l-1} (1 - g_l)$

where $$\alpha^{'}_{l}$$ is raw output of current level $$l$$. For foreground estimation, the method trains a separate model using an encoder-decoder framework, which takes an image and an alpha matte as input. The paper argues that training a single model for predicting both alpha matte and foreground color will degrade the matting performance, whereas, decoupling will allow a flexible extension to the cases where alpha mattes are already given.

The main advantage of the MG Matting is that it can handle different qualities (eg. coarse or rough) and even various types of guidances as input. Thus, it can be considered as either a trimap-based or a trimap-free model depending on what guidance is available. The authors show superior results against state-of-the-art baselines on three benchmark datasets.