Research Blog
Review of recent work by the broader Computer Vision commnunity
Review of recent work by the broader Computer Vision commnunity
In recent years, computer vision has made significant advancements in the field of artificial intelligence and has become an important area of research. However, these advancements come with the need for large amounts of data. With the development of deep learning algorithms, computer vision models require vast amounts of annotated data to achieve high levels of accuracy. This data is used to train the models to recognize and classify objects in images and videos. The availability of large datasets has facilitated the development of complex and accurate computer vision models, but the need for annotated data continues to grow as the field advances. As a result, efforts are being made to develop more efficient algorithms that can perform with less data, and to improve data annotation processes to increase the availability of annotated datasets. These efforts are crucial for further advancement in the field of computer vision and to meet the growing demand for accurate and efficient AI models in a variety of applications.
Video instance segmentation (VIS) is a challenging computer vision task that involves identifying and segmenting objects in a video into individual instances, even when multiple objects of the same category appear in the same frame. To achieve this, current state-of-the-art VIS models rely on deep transformer-based models, which require large amounts of annotated data to train effectively. However, video annotation can be costly and time-consuming, making the VIS task even more challenging. In this paper, the authors propose a novel method that addresses this challenge by leveraging rich temporal mask consistency constraints in the video and using only bounding box annotations for training. The method introduces the Temporal KNN-patch Loss (TK-Loss), which enforces consistency of masks over time and can be easily integrated into existing state-of-the-art VIS methods without modification of the model architecture. Additionally, the TK-Loss does not require training as it does not have any learnable parameters, making it a lightweight yet effective solution for the challenging task of video instance segmentation.
The Temporal Mask Consistency Constraint states that for any given small region within a frame, the corresponding pixels that belong to the projection of this region should have the same mask prediction in every frame, as they belong to the same underlying physical object or background region.
Patch Candidate Extraction: Let \(X-{p}^{p}\) denote the target image patch of size N x N centered at location \(p = (x, y)\) in frame \(t\), and let \(S_{p}^{t\rightarrow\hat{t}}\) denote the set of all correspondence patches in frame \(\hat{t}\) representing the same object region. To find candidate patch locations within a radius of \(R\) from \(p\), we consider all patches \(p^{cap}\) in \(S_{p}^{t\rightarrow\hat{t}}\) for which the Euclidean distance \((||p - p_{cap}||)\) is less than \(R\).
Temporal KNN-Matching: In this step, we select the top K matches with the smallest patch distance, \(D\), and discard the patches whose score is less than \(D\).
\[\begin{equation} d_{p\rightarrow\hat{p}}^{t\rightarrow\hat{t}} = ||X_{p}^{t} - X_{\hat{p}}^{\hat{t}} || \end{equation}\]Consistency Loss: Let \((M_p^t \in [0, 1])\) represent the predicted binary instance mask of an object at location \((p)\) in the target frame \((t)\). The objective function for the Temporal KNN Patch Loss is given as
\[\mathcal{L}_f^{t \rightarrow \hat{t}} = \frac{1}{HW} \sum_{p} \sum_{\hat{p} \in S^{t \rightarrow \hat{t}}_{p}} L_{cons}(M_p^t, M_{\hat{p}_i}^{\hat{t}})\]where the consistency loss is calculated as
\[L_{cons}(M_p^t, M_{\hat{p}_i}^{\hat{t}}) = -\log(M_t^p M_{\hat{t}}^{\hat{p}} + (1 - M_t^p)(1 - M_{\hat{t}}^{\hat{p}}))\]The consistency loss will only become 0 when both predictions indicate the exact foreground or background.
Cyclic Tube Connection: Let \(T\) denote the number of frames in the temporal tube. The Temporal KNN Patch Loss for the entire tube is given by
\[\mathcal{L}_{temp} = \sum_{t = 1}^{T} \begin{cases} \mathcal{L}_f^{t \rightarrow (t + 1)}, &\textrm{if } t < T - 1 \\ \mathcal{L}_f^{t \rightarrow 0} , &\textrm{if } t = T - 1 \end{cases}\]This loss is calculated cyclically. For instance, if there are four frames in the tube, the loss is calculated between frames 1-2, 2-3, 3-4, and 4-1. The final loss term ensures long-range temporal mask consistency.
Mask-free VIS is trained with joint supervision from spatio-temporal surrogate losses. To ensure spatial consistency, they utilized the Box Projection Loss (Lproj) and Pairwise Loss (Lpair) and replaced the supervised mask learning loss.
\[\begin{equation} \mathcal{L}_{proj} = \sum_{t = 1}^{T} \sum_{d\in {\vec{x}, \vec{y}}} D(P_d'(M_{p}^{t}), P_d'(M_{b}^{t})) \end{equation}\]where \(D\) denotes the Dice Loss, \((P')\) is the projection function along \((\vec{x}/\vec{y})\) axis direction, \((M_{p}^{t})\), \((M_{b}^{t})\) denote predicted instance mask and ground truth mask.
\[\begin{equation} \mathcal{L}_{pair} = \frac{1}{T}\sum_{t = 1}^{T} \sum_{p_i' \in H X W} L_{cons}(M_{p_i'}^{t}, M_{p_j'}^{t}) \end{equation}\]The Projection Loss ensures consistency between the predicted instance mask and the ground truth bounding box by projecting the mask onto the x and y axes. Meanwhile, the Pairwise Loss enforces consistency between the predicted labels of pixels at locations \((p_i')\) and \((p_j')\) with color similarity greater than or equal to \((\sigma_{pixel})\) and their corresponding ground truth labels.
The overall spatial loss is calculated as a weighted combination of the Projection Loss and the Pairwise Loss, with the weight for the Pairwise Loss denoted as \((\lambda_{pair})\).
\[\begin{equation} \mathcal{L}_{spatial} = \mathcal{L}_{proj} + \lambda_{pair}\mathcal{L}_{pair} \end{equation}\]And the overall joint spatio-temporal Loss is given as:
\[\begin{equation} \mathcal{L}_{seg} = \mathcal{L}_{spatial} + \lambda_{temp}\mathcal{L}_{temp} \end{equation}\]Where \((\mathcal{L}_{temp})\) is the temporal-KNN Loss. Mask-free VIS is trained with joint supervision from spatio-temporal surrogate losses. To ensure spatial consistency, they utilized the Box Projection Loss (Lproj) and Pairwise Loss (Lpair) and replaced the supervised mask learning loss.
\[\begin{equation} \mathcal{L}_{proj} = \sum_{t = 1}^{T} \sum_{d\in {\vec{x}, \vec{y}}} D(P_d'(M_{p}^{t}), P_d'(M_{b}^{t})) \end{equation}\]where \((D)\) denotes the Dice Loss, \((P')\) is the projection function along \((\vec{x}/\vec{y})\) axis direction, \((M_{p}^{t})\), \((M_{b}^{t})\) denote predicted instance mask and ground truth mask.
\[\begin{equation} \mathcal{L}_{pair} = \frac{1}{T}\sum_{t = 1}^{T} \sum_{p_i' \in H X W} L_{cons}(M_{p_i'}^{t}, M_{p_j'}^{t}) \end{equation}\]The Projection Loss ensures consistency between the predicted instance mask and the ground truth bounding box by projecting the mask onto the x and y axes. Meanwhile, the Pairwise Loss enforces consistency between the predicted labels of pixels at locations \((p_i')\) and \((p_j')\) with color similarity greater than or equal to \((\sigma_{pixel})\) and their corresponding ground truth labels.
The overall spatial loss is calculated as a weighted combination of the Projection Loss and the Pairwise Loss, with the weight for the Pairwise Loss denoted as \((\lambda_{pair})\).
\[\begin{equation} \mathcal{L}_{spatial} = \mathcal{L}_{proj} + \lambda_{pair}\mathcal{L}_{pair} \end{equation}\]And the overall joint spatio-temporal Loss is given as:
\[\begin{equation} \mathcal{L}_{seg} = \mathcal{L}_{spatial} + \lambda_{temp}\mathcal{L}_{temp} \end{equation}\]Where \((\mathcal{L}_{temp})\) is the temporal-KNN Loss.
This study demonstrates that incorporating joint spatio-temporal loss during training improves the performance of the current state-of-the-art VIS model using only bounding box annotations. The proposed method eliminates the need for mask annotation during training and utilizes temporal mask consistency constraints to produce superior results. This approach reduces the disparity between fully and weakly supervised VIS.