Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.
翻译:多模态大语言模型(MLLMs)在视觉-语言任务上持续取得更强劲的性能。即便传统视觉问答基准测试趋于饱和,实际部署仍需要满足现实世界中分布外(OOD)场景的低错误容忍度要求。具体而言,选择性预测旨在提升覆盖率(即系统回答输入的比例),同时遵循用户定义的风险水平。这通常通过为每个答案分配置信度分数,并放弃低于特定阈值的答案来实现。为实现可靠泛化,我们要求推理模型在作答时生成局部化的视觉证据,并设计一个选择器,使其能显式地学习评估推理模型提供的定位质量。我们证明,与不依赖定位基线的基线方法相比,SIEVES(通过视觉证据评分实现选择性预测)在具有挑战性的OOD基准测试(V* Bench、HR-Bench-8k、MME-RealWorld-Lite、VizWiz和AdVQA)中将覆盖率提升至多三倍。除了对OOD任务具有更好的泛化能力外,SIEVES选择器的设计使其可迁移至无法访问权重或logits的专有推理模型(如o3和Gemini-3-Pro),提供超越单纯准确率提升的覆盖率增益。我们强调,SIEVES无需针对特定基准或推理模型进行训练或适配,即可在所有五个测试的OOD数据集和推理模型(Pixel-Reasoner、o3和Gemini-3-Pro)上实现泛化。