Research Blog
Review of recent work by the broader Computer Vision commnunity
Review of recent work by the broader Computer Vision commnunity
Deep neural networks first started achieving high performances mainly in image classification tasks, where for the full image a single class label was to be predicted. Many of the traditional networks, like AlexNet, ResNet, VGGNet etc., were based on “compressing” the image representations and then “decompressing” them as shown in the following figure.
(All the figures in this article are from the original paper, referenced below.)
While such networks work on very coarse image-level prediction tasks, many computer vision tasks require position and pixel level sensitivity. Examples of such tasks are semantic segmentation, pose estimation and even object detection.
We discuss a state of the art network for such tasks, HRNet, which instead of compressing-decompressing the feature representations, maintains high resolution representations at multiple scales and achieves better performances.
HRNet is mainly based on the principle that maintaining high resolution feature representations is critical for vision tasks which require fine position sensitivity.
The network consists of multiple streams, starting from the highest resolution and successively leading to lower resolution. The network, fuses the multiple streams with a multi-resolution fusion mechanism to achieve a final representation which captures the image at a very fine level. This representation is then used with different heads to perform different vision tasks. An instance of the network with 4 such streams is shown in the following figure.
The fusion modules used in the middle of the architectures are based on strided convolutions and upsampling as shown in the following figure.
The fusion of the different streams could be finally utilized in multiple ways for the final task, and they evaluate three different variants, named (a) HRNetV1, (b) HRNetV2, and (c) HRNetV2p and shown below.
In the HRNetV1, the final representation is only from the highest resolution stream. Note that this representation would also contain information from the other streams due the intermedia/blogste fusions. In HRNetV2 the representation with all streams are concatenated after appropriate upsampling, with the upsampling being affected by bilinear upsampling followed by 1x1 convolutions (which are not shown in the figure for brevity). In the final HRNetV2p, a pyramid of features is constructed where the representations from all the different streams are concatenated at the different resolutions with appropriate downsampling or upsampling.
The network is then trained for a variety of tasks, e.g. pose estimation, object detection, semantic segmentation and facial landmark estimation showing very competitive results validating the hypothesis that high resolution representations are critical for such tasks.