Jaeger: A Concatenation-Based Multi-Transformer VQA Model

from arxiv, This paper is the technical research paper of CIKM 2023 DocIU challenges. The authors received the CIKM 2023 DocIU Winner Award, sponsored by Google, Microsoft, and the Centre for data-driven geoscience

Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval. Although there has been encouraging progress in document-based question answering due to the utilization of large language and open-world prior models\cite{1}, several challenges persist, including prolonged response times, extended inference durations, and imprecision in matching. In order to overcome these challenges, we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive question features, we leverage the exceptional capabilities of RoBERTa large\cite{2} and GPT2-xl\cite{3} as feature extractors. Subsequently, we subject the outputs from both models to a concatenation process. This operation allows the model to consider information from diverse sources concurrently, strengthening its representational capability. By leveraging pre-trained models for feature extraction, our approach has the potential to amplify the performance of these models through concatenation. After concatenation, we apply dimensionality reduction to the output features, reducing the model's computational effectiveness and inference time. Empirical results demonstrate that our proposed model achieves competitive performance on Task C of the PDF-VQA Dataset. If the user adds any new data, they should make sure to style it as per the instructions provided in previous sections.

翻译：摘要：文档型视觉问答在语言歧义消解与细粒度多模态检索之间构成了一项具有挑战性的任务。尽管基于大型语言模型和开放世界先验模型的应用在文档型问答中取得了令人鼓舞的进展\cite{1}，但若干挑战依然存在，包括响应时长增加、推理时间延长以及匹配不精确等问题。为克服这些挑战，我们提出了Jaeger，一种基于拼接的多Transformer VQA模型。为提取问题特征，我们利用RoBERTa large\cite{2}和GPT2-xl\cite{3}的卓越能力作为特征提取器。随后，我们对两个模型的输出进行拼接操作。这一处理使模型能够同时考虑来自不同来源的信息，从而增强其表征能力。通过利用预训练模型进行特征提取，我们的方法有潜力通过拼接放大这些模型的性能。在拼接之后，我们对输出特征进行降维处理，从而降低模型的计算开销并缩短推理时间。实验结果表明，所提模型在PDF-VQA数据集的Task C上达到了有竞争力的性能。若用户添加任何新数据，应确保按照前文说明中的指示进行格式设置。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日