Interactive Video Object Segmentation (iVOS) is a challenging task that requires real-time human-computer interaction. To improve the user experience, it is important to consider the user's input habits, segmentation quality, running time and memory consumption.However, existing methods compromise user experience with single input mode and slow running speed. Specifically, these methods only allow the user to interact with one single frame, which limits the expression of the user's intent.To overcome these limitations and better align with people's usage habits, we propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF). Concretely, we designed the Across-Frame Interaction Module that enables users to annotate different objects freely on multiple frames. The AFI module will migrate scribble information among multiple interactive frames and generate multi-frame masks. Additionally, we employ the id-queried mechanism to process multiple objects in batches. Furthermore, for a more efficient propagation and lightweight model, we design a truncated re-propagation strategy to replace the previous multi-round fusion module, which employs an across-round memory that stores important interaction information. Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60). Moreover, our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.
翻译:交互式视频对象分割(iVOS)是一项需要实时人机交互的挑战性任务。为提升用户体验,需充分考虑用户的输入习惯、分割质量、运行时间及内存消耗。然而,现有方法因单一输入模式与缓慢运行速度而折损用户体验。具体而言,这些方法仅允许用户与单帧进行交互,限制了用户意图的表达。为克服上述局限并契合用户的日常使用习惯,我们提出一种可同时接收多帧输入、并探索跨帧协同交互(SIAF)的框架。具体而言,我们设计了跨帧交互模块(AFI),使用户可在多帧上自由标注不同对象。该模块能够在多个交互帧间迁移涂鸦信息,并生成多帧掩码。同时,我们采用id查询机制批量处理多个对象。此外,为实现更高效的传播与轻量化模型,我们设计了截断式再传播策略替代原有的多轮融合模块,该策略利用跨轮记忆机制存储关键交互信息。我们的SwinB-SIAF在DAVIS 2017数据集上达到新的最优性能(89.6%,J&F@60)。此外,在具有挑战性的多对象场景下,R50-SIAF的运行速度比当前最优竞品快3倍以上。