MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Jingqun Tang,Qi Liu,Yongjie Ye,Jinghui Lu,Shu Wei,Chunhui Lin,Wanqing Li,Mohamad Fitri Faiz Bin Mahmood,Hao Feng,Zhen Zhao,Yanjie Wang,Yuliang Liu,Hao Liu,Xiang Bai,Can Huang

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. However, most TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets using translation engines, the translation-based protocol encounters a substantial ``Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it does not adequately tackle challenges related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we address the task of multilingual TEC-VQA and provide a benchmark with high-quality human expert annotations in 9 diverse languages, called MTVQA. To our knowledge, MTVQA is the first multilingual TEC-VQA benchmark to provide human expert annotations for text-centric scenarios. Further, by evaluating several state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4V, on our MTVQA dataset, it is evident that there is still room for performance improvement, underscoring the value of our dataset. We hope this dataset will provide researchers with fresh perspectives and inspiration within the community. The MTVQA dataset will be available at https://huggingface.co/datasets/ByteDance/MTVQA.

翻译：文本中心视觉问答（Text-Centric Visual Question Answering，简称TEC-VQA）以其规范形式不仅促进了文本中心视觉环境中的人机交互，还作为评估AI模型在文本中心场景理解领域的实际黄金代理标准。然而，大多数TEC-VQA基准测试主要关注英语和中文等高资源语言。尽管已有先驱工作利用翻译引擎将非文本中心VQA数据集扩展为多语言问答对，但基于翻译的方法在应用于TEC-VQA时会遇到显著的“视觉-文本错位”问题。具体而言，该方法优先处理问答对中的文本，却忽略了图像中呈现的视觉文本。此外，它未能充分应对与微妙语义、语境扭曲、语言偏差以及问题类型多样性相关的挑战。在本工作中，我们着手处理多语言TEC-VQA任务，并提供了一个名为MTVQA的基准测试，该基准包含9种不同语言的高质量人类专家标注。据我们所知，MTVQA是首个为文本中心场景提供人类专家标注的多语言TEC-VQA基准。通过评估包括GPT-4V在内的多种最先进多模态大语言模型（MLLMs）在MTVQA数据集上的表现，显然性能仍有提升空间，这凸显了我们数据集的价值。我们希望该数据集能为社区内的研究人员提供新的视角和灵感。MTVQA数据集将在https://huggingface.co/datasets/ByteDance/MTVQA上公开提供。