Video Referring Expression Comprehension (REC) aims to localize a target object in videos based on the queried natural language. Recent improvements in video REC have been made using Transformer-based methods with learnable queries. However, we contend that this naive query design is not ideal given the open-world nature of video REC brought by text supervision. With numerous potential semantic categories, relying on only a few slow-updated queries is insufficient to characterize them. Our solution to this problem is to create dynamic queries that are conditioned on both the input video and language to model the diverse objects referred to. Specifically, we place a fixed number of learnable bounding boxes throughout the frame and use corresponding region features to provide prior information. Also, we noticed that current query features overlook the importance of cross-modal alignment. To address this, we align specific phrases in the sentence with semantically relevant visual areas, annotating them in existing video datasets (VID-Sentence and VidSTG). By incorporating these two designs, our proposed model (called ConFormer) outperforms other models on widely benchmarked datasets. For example, in the testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute improvement on [email protected] compared to the previous state-of-the-art model.
翻译:视频指代表达理解(Video REC)旨在根据查询的自然语言在视频中定位目标对象。近期基于Transformer的方法通过学习可查询向量提升了视频REC的性能。然而,我们认为这种朴素查询设计无法适应视频REC因文本监督带来的开放世界特性——当存在大量潜在语义类别时,仅依赖少量慢更新查询难以充分表征这些类别。对此,我们提出构建同时以输入视频和语言为条件的动态查询,以建模多样化的指代对象。具体而言,我们在帧中预设固定数量的可学习边界框,并利用对应区域特征提供先验信息。此外,我们发现当前查询特征忽视了跨模态对齐的重要性。为此,我们将句子中的特定短语与语义相关的视觉区域对齐,并在现有视频数据集(VID-Sentence和VidSTG)中标注这些对应关系。通过融合上述两种设计,所提模型(命名为ConFormer)在广泛基准数据集上超越其他模型。例如,在VID-Sentence数据集的测试集上,ConFormer相较此前最优模型在[email protected]指标上取得了8.75%的绝对性能提升。