TensorTour Research Blog

Recent research work in Computer Vision and Machine Learning from our R&D team members

Predicting Depth from Binaural Echoes

Posted by Gaurav on Tue, 25 Jan 2022 Depth from binaural echoes

Predicting the depth of each element in a scene from an RGB image is an important problem. It finds application in scene understanding tasks with applications in robotics and autonomous driving.

Often depth sensors like LiDAR are used to predict scene depth, but since they are expensive and bulky, researchers have been trying to explore predicting depth from monocular images (single image from a standard camera).

In this article, we present our recent research on the topic published in the prestigious International Conference on Computer Vision and Pattern Recognition (CVPR) 2021, a flagship computer vision conference. We argue that using relatively cheap audio sensors, i.e. an emitter and two receivers capturing echoes, we can improve the depth prediction from monocular images.

(All the figures in this article are from the original paper, referenced below.)

There have been some works on using audio signal for monocular depth prediction. However, while the previous approaches, as shown on the left of the figure above, usually do a simple concatenation of the visual and audio features, we propose to use (i) estimated material properties, and estimate (ii) attention based on a multimodal fusion network, to fuse the two (as shown on the right of the figure above.

Proposed Approach

Beyond Image to Depth: Improving Depth Prediction using Echoes

Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma

IIT Kanpur, CDAC and TensorTour

International Conference on Computer Vision and Pattern Recognition (CVPR), 2021

ArXiv open access full version

In general, in a scene different elements would be at different depths as well as would be made up on different materials. Since sound waves would have to travel different distances, and would be absorbed differently by different materials, the echo signal would depend on both the depth as well as the material properties of the scene elements. In this paper we mainly proposed an attention mechanism between image and audio modalities, as well as incorporated material property estimation to incorporate the differences in absorpotions, of both light and sound, for the task of depth prediction.

The figure above shows the block diagram of the proposed method. It has five main modules: (i) the visual net, (ii) the echo net, (iii) the material net, (iv) the multimodal fusion module, and (v) the attention net.

Visual Net

The Visual Net is an encoder decoder network which predicts depth from monocular RGB image. It is made with regular and strided convolutions with skip connections between them. We use it to obtain the visual features from the intermediate layer, i.e. the last conv layer, which are one of the inputs to the multimodal fusion block.

Echo Net

The Echo Net is also an ecoder decoder network; it predicts the depth from binaural echo input. The inpur is a frequency domain spectrogram representaion of the echo and take the encoded feature as an input to the multimodal fusion block.

Material Net

The Material Net is a standard conv network (ResNet-18) pretrained on Materials Dataset which has classes such as fabric, asphalt, wood, brick etc. It is used to extract the material properties of the objects present in the scene, and the features obtained from it are the last of the three inputs to the multimodal fusion block.

Multimodal Fusion Module

The Multimodal Fusion Module combines the three features, visual, echo and materials, and provides a fused representation to be used as an input to the attention prediction network.

Attention Network

The Attention Network takes as input the features from the multimodal fusion module, and outputs a attention map \(\alpha\), which is used to combine the depth maps obtained from the image (\(D_{i}\)) and from the echo (\(D_{e}\)), as

\[D = \alpha \odot D_{e} + (1-\alpha) \odot D_{i}\]

where, \(\odot\) represents elementwise multiplication of matrices.


The publicly available benchmark datasets called Replica and Matterport3D are used to evaluate the method. We simulate the echoes using a 3D simulator called Habitat, and use the precomputed room impulse response (RIR) provided by a previous research effort. The RIR takes into account the material properties and scene geometry.

On the benchmark dataset, we show quantitatively that incorporating echoes, along with the material properties of the objects in the scenes improves the results over current state of the art methods.

For further detailed explanation and experimental results, please see the full technical publication available as a PDF on arXiv.

The figure above shows some qualitative results where the proposed method performs better than using only RGB image, or only echo, or using existing methods. Notice how the proposed method is better able to capture details compared to the alternatives. Compared to the ground truth depth there is still some improvements to be made on this task.

More blogs