Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.

翻译：尽管在通用任务上取得了进展，视觉-语言模型（VLM）在同时需要细粒度视觉定位和外部知识的挑战中仍然表现挣扎，这一协同作用被现有孤立评估这些能力的基准测试所忽视。为填补这一空白，我们提出Pix2Fact——一个旨在评估专家级视觉感知与知识搜索的视觉问答基准。Pix2Fact包含覆盖八个场景的1,000张高分辨率（4K+）图像。其问题与答案由来自全球顶尖大学、横跨多学科的博士学历标注员精心设计，每个问题都需要详细的视觉定位与外部知识的整合。在评估包括Gemini-3.1-Pro和GPT-5.4等专有模型在内的十个最先进VLM时，我们发现Pix2Fact构成了严峻挑战：最先进的模型（Gemini-3.1-Pro）在即使能访问视觉真值与搜索工具的情况下，平均准确率也仅达51.7%。我们的分析将低准确率归因于三个因素：即使存在视觉真值仍频繁发生的视觉定位错误、浅层搜索利用能力，以及VLM无法检索长尾、非结构化的局部信息。这一显著差距暴露了当前模型在协助人类处理需要强大视觉理解的真实场景时的局限性。我们相信Pix2Fact将成为推动下一代语言-视觉智能体发展的关键基准，促使其无缝融合细粒度感知与稳健知识搜索。