Research Blog

Review of recent work by the broader Computer Vision commnunity

Home   >   Research Blog   >   Review of a state-of-the-art method for position sensitive vision tasks

Review of a state-of-the-art method for position sensitive vision tasks

Posted by Gaurav on Thu, 23 Dec 2021 position sensitive tasks examples segmentation, detection, pose estimation

Deep neural networks first started achieving high performances mainly in image classification tasks, where for the full image a single class label was to be predicted. Many of the traditional networks, like AlexNet, ResNet, VGGNet etc., were based on “compressing” the image representations and then “decompressing” them as shown in the following figure.

(All the figures in this article are from the original paper, referenced below.)


While such networks work on very coarse image-level prediction tasks, many computer vision tasks require position and pixel level sensitivity. Examples of such tasks are semantic segmentation, pose estimation and even object detection.

We discuss a state of the art network for such tasks, HRNet, which instead of compressing-decompressing the feature representations, maintains high resolution representations at multiple scales and achieves better performances.

Method: High Resolution Network (HRNet)

Deep High-Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao

Microsoft Research, China

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

ArXiv open access full version

HRNet is mainly based on the principle that maintaining high resolution feature representations is critical for vision tasks which require fine position sensitivity.

The network consists of multiple streams, starting from the highest resolution and successively leading to lower resolution. The network, fuses the multiple streams with a multi-resolution fusion mechanism to achieve a final representation which captures the image at a very fine level. This representation is then used with different heads to perform different vision tasks. An instance of the network with 4 such streams is shown in the following figure.


The fusion modules used in the middle of the architectures are based on strided convolutions and upsampling as shown in the following figure.


The fusion of the different streams could be finally utilized in multiple ways for the final task, and they evaluate three different variants, named (a) HRNetV1, (b) HRNetV2, and (c) HRNetV2p and shown below.


In the HRNetV1, the final representation is only from the highest resolution stream. Note that this representation would also contain information from the other streams due the intermedia/blogste fusions. In HRNetV2 the representation with all streams are concatenated after appropriate upsampling, with the upsampling being affected by bilinear upsampling followed by 1x1 convolutions (which are not shown in the figure for brevity). In the final HRNetV2p, a pyramid of features is constructed where the representations from all the different streams are concatenated at the different resolutions with appropriate downsampling or upsampling.

The network is then trained for a variety of tasks, e.g. pose estimation, object detection, semantic segmentation and facial landmark estimation showing very competitive results validating the hypothesis that high resolution representations are critical for such tasks.