Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.
翻译:基于文本的视频分割是一项具有挑战性的任务,旨在从视频中分割出自然语言所指的对象。它本质上需要语义理解和细粒度的视频理解。现有方法以自底向上的方式将语言表示引入分割模型,仅能在卷积网络的局部感受野内进行视觉-语言交互。我们认为这种交互是不充分的,因为模型在仅观察到局部信息的情况下难以构建区域级关系,这与自然语言/指代表达的描述逻辑相悖。实际上,人们通常通过描述目标对象与其他对象的关系来指代它,而这种关系在未观看完整视频时难以理解。为解决这一问题,我们模仿人类在语言引导下分割对象的方式,提出了一种新颖的自顶向下方法。我们首先找出视频中的所有候选对象,然后通过解析这些高层对象之间的关系来选择被指代的对象。为实现精确的关系理解,我们研究了三种对象级关系:位置关系、文本引导的语义关系和时序关系。在A2D Sentences和J-HMDB Sentences数据集上的大量实验表明,我们的方法以较大优势超越了现有最先进方法。定性结果也显示我们的结果更具可解释性。