MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

Video panoptic segmentation requires consistently segmenting (for both `thing' and `stuff' classes) and tracking objects in a video over time. In this work, we present MaXTron, a general framework that exploits Mask XFormer with Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask transformer by leveraging trajectory attention. The deployed mask transformer takes as input a short clip consisting of only a few frames and predicts the clip-level segmentation. To enhance the temporal consistency, MaXTron employs within-clip and cross-clip tracking modules, efficiently utilizing trajectory attention. Originally designed for video classification, trajectory attention learns to model the temporal correspondences between neighboring frames and aggregates information along the estimated motion paths. However, it is nontrivial to directly extend trajectory attention to the per-pixel dense prediction tasks due to its quadratic dependency on input size. To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively. Particularly, in our within-clip tracking module, we propose axial-trajectory attention that effectively computes the trajectory attention for tracking dense pixels sequentially along the height- and width-axes. The axial decomposition significantly reduces the computational complexity for dense pixel features. In our cross-clip tracking module, since the object queries in mask transformer are learned to encode the object information, we are able to capture the long-term temporal connections by applying trajectory attention to object queries, which learns to track each object across different clips. Without bells and whistles, MaXTron demonstrates state-of-the-art performances on video segmentation benchmarks.

翻译：视频全景分割要求对视频中的对象（包括"物体"和"stuff"类）进行持续分割与跟踪。本文提出MaXTron，一种利用掩码XFormer结合轨迹注意力（Trajectory Attention）处理该任务的通用框架。MaXTron通过引入轨迹注意力增强现成掩码Transformer的性能。所部署的掩码Transformer以少量帧组成的短片段为输入，预测片段级别的分割结果。为提升时间一致性，MaXTron采用片段内与跨片段跟踪模块，高效利用轨迹注意力。轨迹注意力最初为视频分类设计，能建模相邻帧之间的时序对应关系，并沿估计运动路径聚合信息。然而，由于其计算复杂度与输入尺寸呈二次依赖关系，直接将轨迹注意力扩展到逐像素密集预测任务面临困难。为解决此问题，我们提出将轨迹注意力分别适配于密集像素特征和对象查询，旨在分别提升短时与长时跟踪结果。具体而言，在片段内跟踪模块中，我们提出轴向轨迹注意力（axial-trajectory attention），沿高度轴和宽度轴顺序处理密集像素的轨迹注意力计算。轴向分解显著降低了密集像素特征的计算复杂度。在跨片段跟踪模块中，由于掩码Transformer的对象查询已学习编码对象信息，我们通过对对象查询应用轨迹注意力，能够捕捉各片段间的长时序连接，从而学习跨片段跟踪每个对象。无需复杂设计，MaXTron在视频分割基准上取得了最先进的性能。