Research Blog

Review of recent work by the broader Computer Vision commnunity

Home   >   Research Blog   >   TK-Loss Review: Mask-Free Visual Instance Segmentation

TK-Loss Review: Mask-Free Visual Instance Segmentation

Posted by Himanshu on Wed, 16 Aug 2023 TK-Loss Review

In recent years, computer vision has made significant advancements in the field of artificial intelligence and has become an important area of research. However, these advancements come with the need for large amounts of data. With the development of deep learning algorithms, computer vision models require vast amounts of annotated data to achieve high levels of accuracy. This data is used to train the models to recognize and classify objects in images and videos. The availability of large datasets has facilitated the development of complex and accurate computer vision models, but the need for annotated data continues to grow as the field advances. As a result, efforts are being made to develop more efficient algorithms that can perform with less data, and to improve data annotation processes to increase the availability of annotated datasets. These efforts are crucial for further advancement in the field of computer vision and to meet the growing demand for accurate and efficient AI models in a variety of applications.

What is Video Instance Segmentation?

Video instance segmentation (VIS) is a challenging computer vision task that involves identifying and segmenting objects in a video into individual instances, even when multiple objects of the same category appear in the same frame. To achieve this, current state-of-the-art VIS models rely on deep transformer-based models, which require large amounts of annotated data to train effectively. However, video annotation can be costly and time-consuming, making the VIS task even more challenging. In this paper, the authors propose a novel method that addresses this challenge by leveraging rich temporal mask consistency constraints in the video and using only bounding box annotations for training. The method introduces the Temporal KNN-patch Loss (TK-Loss), which enforces consistency of masks over time and can be easily integrated into existing state-of-the-art VIS methods without modification of the model architecture. Additionally, the TK-Loss does not require training as it does not have any learnable parameters, making it a lightweight yet effective solution for the challenging task of video instance segmentation.

Temporal Mask Consistency Constraint

The Temporal Mask Consistency Constraint states that for any given small region within a frame, the corresponding pixels that belong to the projection of this region should have the same mask prediction in every frame, as they belong to the same underlying physical object or background region.

Method:Temporal KNN Patch Loss

Mask-Free Video Instance Segmentation

Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu

ETH Zuric, HKUST

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

CVPR official version

Fig. 1 Temporal KNN-patch Loss has four steps: 1) Patch Candidate Extraction: Patch candidates searching across frames with radius R. 2) Temporal KNN-Matching: Match k high-confidence candidates by patch affinities. 3) Consistency loss: Enforce mask consistency objective (Eq. 2) among the matches. 4) Cyclic Tube Connection: Temporal loss aggregation in the 5-frame tube.

Patch Candidate Extraction

Patch Candidate Extraction: Let \(X-{p}^{p}\) denote the target image patch of size N x N centered at location \(p = (x, y)\) in frame \(t\), and let \(S_{p}^{t\rightarrow\hat{t}}\) denote the set of all correspondence patches in frame \(\hat{t}\) representing the same object region. To find candidate patch locations within a radius of \(R\) from \(p\), we consider all patches \(p^{cap}\) in \(S_{p}^{t\rightarrow\hat{t}}\) for which the Euclidean distance \((||p - p_{cap}||)\) is less than \(R\).

Temporal KNN-Matching: In this step, we select the top K matches with the smallest patch distance, \(D\), and discard the patches whose score is less than \(D\).

\[\begin{equation} d_{p\rightarrow\hat{p}}^{t\rightarrow\hat{t}} = ||X_{p}^{t} - X_{\hat{p}}^{\hat{t}} || \end{equation}\]

Consistency Loss: Let \((M_p^t \in [0, 1])\) represent the predicted binary instance mask of an object at location \((p)\) in the target frame \((t)\). The objective function for the Temporal KNN Patch Loss is given as

\[\mathcal{L}_f^{t \rightarrow \hat{t}} = \frac{1}{HW} \sum_{p} \sum_{\hat{p} \in S^{t \rightarrow \hat{t}}_{p}} L_{cons}(M_p^t, M_{\hat{p}_i}^{\hat{t}})\]

where the consistency loss is calculated as

\[L_{cons}(M_p^t, M_{\hat{p}_i}^{\hat{t}}) = -\log(M_t^p M_{\hat{t}}^{\hat{p}} + (1 - M_t^p)(1 - M_{\hat{t}}^{\hat{p}}))\]

The consistency loss will only become 0 when both predictions indicate the exact foreground or background.

Cyclic Tube Connection: Let \(T\) denote the number of frames in the temporal tube. The Temporal KNN Patch Loss for the entire tube is given by

\[\mathcal{L}_{temp} = \sum_{t = 1}^{T} \begin{cases} \mathcal{L}_f^{t \rightarrow (t + 1)}, &\textrm{if } t < T - 1 \\ \mathcal{L}_f^{t \rightarrow 0} , &\textrm{if } t = T - 1 \end{cases}\]

This loss is calculated cyclically. For instance, if there are four frames in the tube, the loss is calculated between frames 1-2, 2-3, 3-4, and 4-1. The final loss term ensures long-range temporal mask consistency.

Joint Spatio-temporal Regularization

Mask-free VIS is trained with joint supervision from spatio-temporal surrogate losses. To ensure spatial consistency, they utilized the Box Projection Loss (Lproj) and Pairwise Loss (Lpair) and replaced the supervised mask learning loss.

\[\begin{equation} \mathcal{L}_{proj} = \sum_{t = 1}^{T} \sum_{d\in {\vec{x}, \vec{y}}} D(P_d'(M_{p}^{t}), P_d'(M_{b}^{t})) \end{equation}\]

where \(D\) denotes the Dice Loss, \((P')\) is the projection function along \((\vec{x}/\vec{y})\) axis direction, \((M_{p}^{t})\), \((M_{b}^{t})\) denote predicted instance mask and ground truth mask.

\[\begin{equation} \mathcal{L}_{pair} = \frac{1}{T}\sum_{t = 1}^{T} \sum_{p_i' \in H X W} L_{cons}(M_{p_i'}^{t}, M_{p_j'}^{t}) \end{equation}\]

The Projection Loss ensures consistency between the predicted instance mask and the ground truth bounding box by projecting the mask onto the x and y axes. Meanwhile, the Pairwise Loss enforces consistency between the predicted labels of pixels at locations \((p_i')\) and \((p_j')\) with color similarity greater than or equal to \((\sigma_{pixel})\) and their corresponding ground truth labels.

The overall spatial loss is calculated as a weighted combination of the Projection Loss and the Pairwise Loss, with the weight for the Pairwise Loss denoted as \((\lambda_{pair})\).

\[\begin{equation} \mathcal{L}_{spatial} = \mathcal{L}_{proj} + \lambda_{pair}\mathcal{L}_{pair} \end{equation}\]

And the overall joint spatio-temporal Loss is given as:

\[\begin{equation} \mathcal{L}_{seg} = \mathcal{L}_{spatial} + \lambda_{temp}\mathcal{L}_{temp} \end{equation}\]

Where \((\mathcal{L}_{temp})\) is the temporal-KNN Loss. Mask-free VIS is trained with joint supervision from spatio-temporal surrogate losses. To ensure spatial consistency, they utilized the Box Projection Loss (Lproj) and Pairwise Loss (Lpair) and replaced the supervised mask learning loss.

\[\begin{equation} \mathcal{L}_{proj} = \sum_{t = 1}^{T} \sum_{d\in {\vec{x}, \vec{y}}} D(P_d'(M_{p}^{t}), P_d'(M_{b}^{t})) \end{equation}\]

where \((D)\) denotes the Dice Loss, \((P')\) is the projection function along \((\vec{x}/\vec{y})\) axis direction, \((M_{p}^{t})\), \((M_{b}^{t})\) denote predicted instance mask and ground truth mask.

\[\begin{equation} \mathcal{L}_{pair} = \frac{1}{T}\sum_{t = 1}^{T} \sum_{p_i' \in H X W} L_{cons}(M_{p_i'}^{t}, M_{p_j'}^{t}) \end{equation}\]

The Projection Loss ensures consistency between the predicted instance mask and the ground truth bounding box by projecting the mask onto the x and y axes. Meanwhile, the Pairwise Loss enforces consistency between the predicted labels of pixels at locations \((p_i')\) and \((p_j')\) with color similarity greater than or equal to \((\sigma_{pixel})\) and their corresponding ground truth labels.

The overall spatial loss is calculated as a weighted combination of the Projection Loss and the Pairwise Loss, with the weight for the Pairwise Loss denoted as \((\lambda_{pair})\).

\[\begin{equation} \mathcal{L}_{spatial} = \mathcal{L}_{proj} + \lambda_{pair}\mathcal{L}_{pair} \end{equation}\]

And the overall joint spatio-temporal Loss is given as:

\[\begin{equation} \mathcal{L}_{seg} = \mathcal{L}_{spatial} + \lambda_{temp}\mathcal{L}_{temp} \end{equation}\]

Where \((\mathcal{L}_{temp})\) is the temporal-KNN Loss.

This study demonstrates that incorporating joint spatio-temporal loss during training improves the performance of the current state-of-the-art VIS model using only bounding box annotations. The proposed method eliminates the need for mask annotation during training and utilizes temporal mask consistency constraints to produce superior results. This approach reduces the disparity between fully and weakly supervised VIS.

More blogs