Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval. Although there has been encouraging progress in document-based question answering due to the utilization of large language and open-world prior models\cite{1}, several challenges persist, including prolonged response times, extended inference durations, and imprecision in matching. In order to overcome these challenges, we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive question features, we leverage the exceptional capabilities of RoBERTa large\cite{2} and GPT2-xl\cite{3} as feature extractors. Subsequently, we subject the outputs from both models to a concatenation process. This operation allows the model to consider information from diverse sources concurrently, strengthening its representational capability. By leveraging pre-trained models for feature extraction, our approach has the potential to amplify the performance of these models through concatenation. After concatenation, we apply dimensionality reduction to the output features, reducing the model's computational effectiveness and inference time. Empirical results demonstrate that our proposed model achieves competitive performance on Task C of the PDF-VQA Dataset. If the user adds any new data, they should make sure to style it as per the instructions provided in previous sections.
翻译:摘要:文档型视觉问答在语言歧义消解与细粒度多模态检索之间构成了一项具有挑战性的任务。尽管基于大型语言模型和开放世界先验模型的应用在文档型问答中取得了令人鼓舞的进展\cite{1},但若干挑战依然存在,包括响应时长增加、推理时间延长以及匹配不精确等问题。为克服这些挑战,我们提出了Jaeger,一种基于拼接的多Transformer VQA模型。为提取问题特征,我们利用RoBERTa large\cite{2}和GPT2-xl\cite{3}的卓越能力作为特征提取器。随后,我们对两个模型的输出进行拼接操作。这一处理使模型能够同时考虑来自不同来源的信息,从而增强其表征能力。通过利用预训练模型进行特征提取,我们的方法有潜力通过拼接放大这些模型的性能。在拼接之后,我们对输出特征进行降维处理,从而降低模型的计算开销并缩短推理时间。实验结果表明,所提模型在PDF-VQA数据集的Task C上达到了有竞争力的性能。若用户添加任何新数据,应确保按照前文说明中的指示进行格式设置。