Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
翻译:大型视觉语言模型(VLM)在多模态理解方面取得了显著进展,但在处理信息密集型图像时仍面临挑战,这类图像将文本标注与细粒度图形元素密集交织。主要难点在于精确定位密集布局中的关键线索,以及整合分散证据所需的多步推理。我们提出了推测裁决(SV),这是一个受推测解码启发的免训练框架,它结合了多个轻量级草案专家与一个大型裁决模型。在草案阶段,小型VLM作为草案专家生成推理路径,提供多样化的定位候选;在裁决阶段,一个强大的VLM综合这些路径以产生最终答案,从而在最小化计算成本的同时恢复正确答案。为了进一步提高效率和准确性,SV引入了共识专家选择机制,仅将高一致性的推理路径转发给裁决模型。实证结果表明,SV在具有挑战性的信息密集型和高分辨率视觉问答基准测试(包括InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K)上均取得了稳定的性能提升。通过综合多个部分准确的推理路径中的正确见解,与大型专有模型或训练流程相比,SV同时实现了错误校正与成本效益。代码发布于 https://github.com/Tinaliu0123/speculative-verdict