Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict

翻译：大型视觉语言模型（VLM）在多模态理解方面取得了显著进展，但在处理信息密集型图像时仍面临挑战，这类图像将文本标注与细粒度图形元素密集交织。主要难点在于精确定位密集布局中的关键线索，以及整合分散证据所需的多步推理。我们提出了推测裁决（SV），这是一个受推测解码启发的免训练框架，它结合了多个轻量级草案专家与一个大型裁决模型。在草案阶段，小型VLM作为草案专家生成推理路径，提供多样化的定位候选；在裁决阶段，一个强大的VLM综合这些路径以产生最终答案，从而在最小化计算成本的同时恢复正确答案。为了进一步提高效率和准确性，SV引入了共识专家选择机制，仅将高一致性的推理路径转发给裁决模型。实证结果表明，SV在具有挑战性的信息密集型和高分辨率视觉问答基准测试（包括InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K）上均取得了稳定的性能提升。通过综合多个部分准确的推理路径中的正确见解，与大型专有模型或训练流程相比，SV同时实现了错误校正与成本效益。代码发布于 https://github.com/Tinaliu0123/speculative-verdict

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日