VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

翻译：现实文档将文本与表格、图表、照片、示意图等视觉元素以多样化的布局相结合，然而现有关于多模态大语言模型（MLLMs）在文档问答中的研究主要生成纯文本回答，未能充分利用这些视觉元素。我们提出VinQA数据集，用于生成长文本回答，其中引用的视觉元素与其支撑文本显式交错排列，并锚定在相关文档页面中。为支持此任务，我们研究了两种将原始文档页面图像输入MLLM的编码方法及其对应的视觉元素引用机制：（1）页面编码：直接编码包含视觉元素边界框的整页图像，并将这些框选区域视为可引用单元；（2）模态编码：解析每页图像以提取文本并裁剪视觉元素，分别编码后将裁剪后的元素作为可引用单元。在实验中，我们提出M-GroSE多模态评估框架，扩展GroUSE以从完整性、答案相关性、忠实性和不可答性四个维度评估答案。我们还报告视觉源F1分数以直接衡量视觉引用准确性。尽管专有前沿模型在VinQA测试集上仍获得最佳整体分数，但在训练集上微调开源的Qwen2.5-VL模型显著提升了其性能并缩小了差距。对于包含长文本、大量视觉元素和多样引用需求的复杂文档，模态编码初始更稳健。然而，在VinQA上训练后，页面编码达到了与之相当的水平，即使未使用模态编码中的显式解析也能有效竞争。最后，基于MLLM的评判器Visual G-Eval证实，微调模型能在语义恰当的位置插入视觉元素并配有忠实的支撑文本。