In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at https://github.com/google-research/troyvis.
翻译:本文致力于解决实时开放词汇视频实例分割(OV-VIS)的挑战。我们分析了当前执行OV-VIS的先进基础模型的计算瓶颈,并提出了一种新方法TROY-VIS,该方法在保持高精度的同时显著提升了处理速度。我们引入了三项关键技术:(1)解耦注意力特征增强器,用于加速不同模态与尺度间的信息交互;(2)闪存嵌入存储器,用于快速获取物体类别的文本嵌入;(3)核插值法,用于利用视频中的时间连续性。我们的实验表明,TROY-VIS在两个大规模OV-VIS基准数据集BURST和LV-VIS上实现了精度与速度的最佳权衡,其运行速度比GLEE-Lite快20倍(25 FPS对比1.25 FPS),同时精度相当甚至更优。这些结果证明了TROY-VIS在动态环境(如移动机器人和增强现实)中实时应用的潜力。代码与模型将在 https://github.com/google-research/troyvis 发布。