GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce \textbf{GRAtt-VIS}, \textbf{G}ated \textbf{R}esidual \textbf{Att}ention for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as \textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods. Code is available at \url{https://github.com/Tanveer81/GRAttVIS}.

翻译：近年来，视频实例分割的趋势日益依赖在线方法来建模复杂且冗长的视频序列。然而，在线方法在表示退化和噪声累积方面存在显著挑战，尤其是在遮挡和突变场景下。基于Transformer的查询传播方法以二次内存注意力为代价提供了有前景的方向，但它们容易因上述挑战而遭受实例特征退化，并产生级联效应。此类错误的检测与修正仍是一个未充分探索的领域。为此，我们提出**GRAtt-VIS**（**门**控**残**差**注**意力用于**视**频**实**例**分**割）。首先，我们利用基于Gumbel-Softmax的门控机制检测当前帧中的潜在错误。其次，基于门控激活，我们从历史表示中修正退化特征。这种残差配置消除了对专用内存的需求，并提供了连续的相关实例特征流。此外，我们提出一种新颖的实例间交互方法，将门控激活作为自注意力的掩码。这种掩码策略动态限制自注意力中无代表性的实例查询，同时保留长期跟踪所需的关键信息。我们将这种门控残差连接与掩码自注意力的新颖组合称为**GRAtt**模块，可轻松集成到现有的基于传播的框架中。进一步地，GRAtt模块显著降低了注意力开销，并简化了动态时序建模。GRAtt-VIS在YouTube-VIS和极具挑战性的OVIS数据集上达到了最先进性能，相较于先前方法有显著提升。代码已开源在\url{https://github.com/Tanveer81/GRAttVIS}。