Research Blog

Review of recent work by the broader Computer Vision commnunity

Home   >   Research Blog   >   Review of a state-of-the-art method for Background Removal

Review of a state-of-the-art method for Background Removal

Posted by Gaurav on Fri, 12 Nov 2021 Background removal examples

Background removal and salient object segmentation are very important and challenging problem, both from research and practical application point of views. In this blog we review some of the state of the art methods tasks which have been published in top Computer Vision and Machine Learning (CVML) venues (conferences or journals). The research area has been active since quite a few years and is closely related to many other tasks which are independently studied as well.

In this article we discuss U-Square Net (U2-Net), one of the current state-of-the-art networks in salient object segmentation.

We do note that some authors refer to the problem as salient object detection (while still predicting pixel level masks and not just bounding boxes). Since the traditional object detection methods predict object masks and pixel wise prediction tasks have been more commonly called segmentation, we stick to the salient object segmentation nomenclature here to be unambiguous about the pixel level prediction.

What is background removal?

Background removal, or the complementary task of salient object segmentation, refers to dividing the image pixels into primarily two classes — foreground or background. As shown in the top image, the objects can be anything in the general version of the problem, and the task can potentially become ambiguous in the presence of multiple objects when different people might disagree on which object is the important one vs. the rest. In the research community, a unambiguous version of the problem is studied where the images or videos in the benchmark dataset only contain a single prominent object.

Method: U-square Net (U2Net)

U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane and Martin Jagersand

University of Alberta, Canada

Pattern Recognition, 2020

ArXiv open access full version

All the figures below are from the original paper, referenced above.

Authors of U2Net pointed out that most of the existing methods on salient object segmentation reused existing deep learning backbones like AlexNet, VGG and ResNet. Since these backbones were designed for image classification they may not be appropriate for pixel level task. In addition, the network architectures of such backbones are complex, and they require heavy computation to pre-train on large datasets. In the U2Net paper, the authors thus proposed a novel architecture which retains sufficient information for pixel level task, can be trained from scratch, and runs near realtime on a modern GPU.


Towards a new architecture, that authors propose a ReSidual U-block (RSU) which extracts multi-scale features from the input image/feature map. It contains successive downsampling with conv+BN+ReLU, followed by a U structure replacing the downsampling with upsampling. In the figure above, they show the plain (PLN), residual (RES), denseNet (DES) and inception (INC) units along side the proposed RSU.

U2Net architecture

Using the RSU, the authors then propose a nested U-Net architecture, where each of the block is a RSU. The output segmentation from each of the stages is concatenated and fused to make the final saliency map prediction.

The authors pose the main advantage of the network to be the capability to extract multi-scale features within each stage/level of the architecture, as well as to aggregate multi-level features together for better prediction. The authors show very competitive results on 6 salient object detection benchmark datasets.