In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.
翻译:本文致力于将视听问答(AVQA)扩展到多语言环境。现有AVQA研究主要围绕英语展开,若要在其他语言中复现该技术以处理AVQA任务,需要投入大量资源。作为一种可扩展的解决方案,我们利用机器翻译技术,基于现有基准AVQA数据集构建了涵盖八种语言的两个多语言AVQA数据集。这种方法避免了人工收集问答数据所需的大量标注工作。为此,我们提出MERA框架,通过整合最先进的视频、音频与文本基础模型来实现多语言AVQA。我们设计了一系列模型架构(包括MERA-L、MERA-C、MERA-T)作为基准模型对提出的数据集进行系统性评估。我们相信这项工作将为多语言AVQA开辟新的研究方向,并为未来研究提供参考基准。