Visual question answering (VQA) has emerged as a flexible approach for extracting specific pieces of information from document images. However, existing work typically queries each field in isolation, overlooking potential dependencies across multiple items. This paper investigates the merits of extracting multiple fields jointly versus separately. Through experiments on multiple large vision language models and datasets, we show that jointly extracting fields often improves accuracy, especially when the fields share strong numeric or contextual dependencies. We further analyze how performance scales with the number of requested items and use a regression based metric to quantify inter field relationships. Our results suggest that multi field prompts can mitigate confusion arising from similar surface forms and related numeric values, providing practical methods for designing robust VQA systems in document information extraction tasks.
翻译:视觉问答(VQA)已成为从文档图像中提取特定信息的灵活方法。然而,现有工作通常孤立地查询每个字段,忽略了多个项目之间潜在的依赖关系。本文研究了联合提取与单独提取多个字段的优势。通过对多个大型视觉语言模型和数据集的实验,我们发现联合提取字段通常能提高准确性,尤其是在字段间存在强数值或上下文依赖关系时。我们进一步分析了性能随请求项目数量变化的规律,并使用基于回归的度量来量化字段间的关系。结果表明,多字段提示可以缓解由相似表面形式和关联数值引起的混淆,为设计文档信息提取任务中稳健的VQA系统提供了实用方法。