Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.

翻译：现有的多模态文档问答方法普遍采用供给端摄取策略：在索引阶段对每一页运行视觉语言模型以生成全面描述，随后通过文本检索回答问题。然而，这种“预摄取”方法成本高昂（一个113页的工程图纸包约需80,000个VLM词元）、端到端可靠性不足（由于检索基础设施中的格式失配，VLM输出可能无法被正确检索），且一旦失败便无法恢复。本文提出延迟视觉摄取框架，采用需求端摄取策略：索引阶段仅执行轻量级元数据提取，将视觉理解推迟至用户提出具体问题时进行。DVI的核心原则是“为定位而索引，非为理解而索引”——通过结构化元数据索引与BM25全文检索实现页面定位，随后将原始图像与具体问题一并发送至VLM进行针对性分析。在两个真实工业工程图纸数据集（113页+7页）上的实验表明，DVI在零摄取VLM成本下达到可比的整体准确率（46.7% vs. 48.9%），在视觉必要查询上的有效率达50%（预摄取方法为0%），并实现100%的页面定位率（98%的搜索空间压缩率）。DVI同时支持交互式优化与渐进式缓存，将“问答准确率”问题转化为“页面定位”问题——一旦找到正确的图纸页面，获取答案便成为交互轮次的问题。