When trying to answer complex questions, people often rely on multiple sources of information, such as visual, textual, and tabular data. Previous approaches to this problem have focused on designing input features or model structure in the multi-modal space, which is inflexible for cross-modal reasoning or data-efficient training. In this paper, we call for an alternative paradigm, which transforms the images and tables into unified language representations, so that we can simplify the task into a simpler textual QA problem that can be solved using three steps: retrieval, ranking, and generation, all within a language space. This idea takes advantage of the power of pre-trained language models and is implemented in a framework called Solar. Our experimental results show that Solar outperforms all existing methods by 10.6-32.3 pts on two datasets, MultimodalQA and MMCoQA, across ten different metrics. Additionally, Solar achieves the best performance on the WebQA leaderboard
翻译:当人们试图回答复杂问题时,往往需要依赖多种信息来源,例如视觉数据、文本数据和表格数据。以往针对该问题的研究方法主要集中于在多模态空间中设计输入特征或模型结构,但这导致跨模态推理或数据高效训练缺乏灵活性。本文提出一种替代范式:通过将图像和表格转化为统一的语言表示,从而将复杂问题简化为可在语言空间中通过三个步骤(检索、排序和生成)解决的纯文本问答任务。该思想充分利用了预训练语言模型的能力,并在名为Solar的框架中实现。实验结果表明,在MultimodalQA和MMCoQA两个数据集的十项不同指标上,Solar的性能均超越现有方法10.6-32.3个百分点。此外,Solar在WebQA排行榜上取得了最佳成绩。