Local-Global Context Aware Transformer for Language-Guided Video Segmentation

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components -- one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution. Our code and dataset are available at: https://github.com/leonnnop/Locater

翻译：我们探索了语言引导视频分割（LVS）任务。此前算法多采用3D CNN学习视频表示，难以捕获长程上下文且易出现视觉-语言错配。为此，我们提出Locater（局部-全局上下文感知Transformer），通过有限内存增强Transformer架构，能以高效方式利用语言表达查询整个视频。该内存包含两个组件：一个用于持久保存全局视频内容，另一个用于动态收集局部时序上下文与分割历史。基于记忆的局部-全局上下文及每帧特定内容，Locater整体且灵活地将表达理解为每帧的自适应查询向量。该向量用于查询对应帧以生成掩码。该内存还使Locater能以线性时间复杂度和恒定内存处理视频，而标准Transformer自注意力计算随序列长度二次增长。为深入检验LVS模型的视觉定位能力，我们贡献了新LVS数据集A2D-S+，它基于A2D-S数据集构建，但增加了区分相似对象的挑战。在三个LVS数据集及我们的A2D-S+上的实验表明，Locater优于先前最优方法。此外，我们以Locater为基石，在第三届大规模视频对象分割挑战赛的指代视频对象分割赛道中获得第一名。我们的代码与数据集开源地址为：https://github.com/leonnnop/Locater

相关内容

LVS

关注 0

LVS （Linux虚拟服务器） LVS集群采用IP负载均衡技术和基于内容请求分发技术。调度器具有很好的吞吐率，将请求均衡地转移到不同的服务器上执行，且调度器自动屏蔽掉服务器的故障，从而将一组服务器构成一个高性能的、高可用的虚拟服务器。整个服务器集群的结构对客户是透明的，而且无需修改客户端和服务器端的程序。为此，在设计时需要考虑系统的透明性、可伸缩性、高可用性和易管理性。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日