Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.
翻译:视觉扩展大模型在视觉问答领域取得了显著进展。然而,面对涉及长尾实体的查询时,这些模型仍存在严重困难,容易产生错误或幻觉性回答。本文提出了一种名为**SnapNTell**的新型评估基准,专门针对实体中心型视觉问答任务设计。该任务旨在测试模型识别实体并提供特定实体详细知识的能力。我们构建了区别于传统视觉问答数据集的**SnapNTell数据集**:(1) 涵盖丰富的分类实体,每个实体均由图像表示且答案中明确命名;(2) 包含需要广泛知识才能准确回答的问答对。该数据集按22个主要类别组织,共包含7,568个独立实体。针对每个实体,我们精选了10张示例图像并设计了10组知识密集型问答对。为解决这一新型任务,我们设计了一种可扩展、高效且透明的检索增强多模态大模型。在SnapNTell数据集上,我们的方法显著优于现有方法,BELURT评分提升了66.5%。数据集和源代码将很快公开。