Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.
翻译:视频-语言理解在工业领域有多种应用,如视频问答、文本-视频检索和多标签分类。现有视频-语言理解方法通常采用重型多模态编码器和特征融合模块,计算成本高昂。特别是,它们难以处理工业应用中普遍存在的密集视频帧或长文本。本文提出MuLTI,一种高精度、高效率的视频-语言理解模型,实现高效的特征融合并快速适配下游任务。具体地,我们设计了基于适应性池化残差映射和自注意力模块的文本引导多路采样器,用于长序列采样和多模态特征融合,从而降低计算成本并解决先前采样器导致的性能下降问题。因此,MuLTI能够在有限计算成本下处理更长序列。此外,为进一步提升模型性能并填补视频问答中预训练任务的缺失,我们提出了一项新的预训练任务——多项选择建模。该任务弥合了预训练与下游任务之间的差距,并提升了模型对齐视频和文本特征的能力。得益于高效的特征融合模块和新的预训练任务,MuLTI在多个数据集上实现了最先进的性能。实现代码和预训练模型将公开发布。