Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval. Although there has been encouraging progress in document-based question answering due to the utilization of large language and open-world prior models\cite{1}, several challenges persist, including prolonged response times, extended inference durations, and imprecision in matching. In order to overcome these challenges, we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive question features, we leverage the exceptional capabilities of RoBERTa large\cite{2} and GPT2-xl\cite{3} as feature extractors. Subsequently, we subject the outputs from both models to a concatenation process. This operation allows the model to consider information from diverse sources concurrently, strengthening its representational capability. By leveraging pre-trained models for feature extraction, our approach has the potential to amplify the performance of these models through concatenation. After concatenation, we apply dimensionality reduction to the output features, reducing the model's computational effectiveness and inference time. Empirical results demonstrate that our proposed model achieves competitive performance on Task C of the PDF-VQA Dataset. If the user adds any new data, they should make sure to style it as per the instructions provided in previous sections.
翻译:文档型视觉问答在语言歧义消解与细粒度多模态检索之间提出了具有挑战性的任务。尽管由于大语言模型和开放世界先验模型的应用,文档型问答已取得令人鼓舞的进展,但仍存在响应时间长、推理时长增加以及匹配不精确等若干挑战。为克服这些困难,我们提出Jaegar——一种基于拼接的多Transformer VQA模型。为推导问题特征,我们利用RoBERTa large和GPT2-xl的卓越能力作为特征提取器。随后,将两个模型的输出进行拼接操作。这一操作使得模型能够同时考虑来自不同来源的信息,增强其表征能力。通过利用预训练模型进行特征提取,我们的方法有潜力通过拼接来放大这些模型的性能。拼接后,我们对输出特征进行降维处理,降低模型的计算开销与推理时间。实验结果表明,所提模型在PDF-VQA数据集的任务C上取得了具有竞争力的性能。若用户添加新数据,应确保按照前文所述指导进行格式处理。