Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.
翻译:视觉问答(VQA)基准数据集多侧重于仅凭视觉内容即可解决的感知型任务。然而,实际场景中诸多问题需借助图像中不可直接观测的外部知识方能正确解答。我们提出WikiVQABench——一种由人工精心筛选的知识驱动型VQA基准数据集,通过系统整合维基百科图像、关联文章标题及维基数据中的结构化知识构建而成。本流水线利用大语言模型(LLMs)生成候选多项选择式图像-问题-答案组,所有生成样例均经人工标注者审查与策展,以确保事实准确性、视觉-文本一致性,并确保每道问题需在视觉证据基础上额外借助外部知识方能正确求解。WikiVQABench包含大量维基百科图像及配套策划的多项选择题,旨在评估知识感知型视觉-语言模型(VLMs)。对十五个VLM(参数规模256M-90B)的评估显示其准确率跨度达24.7%-75.6%,证明该基准能有效区分模型在知识密集型推理中的能力差异。数据集及基准测试代码已公开提供。