Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/
翻译:视觉-语言模型(VLMs)在广泛的视觉理解任务中表现出色,但仍存在粒度较粗、存在视觉偏见以及忽略细微视觉细节的问题。现有的训练语料库通过强调通用识别("这是猫还是狗?")而非细粒度感知,进一步强化了这一局限性。为解决此问题,我们引入了一种新的训练语料库和任务,旨在增强VLMs的感知能力。TWIN是一个包含56.1万对图像查询的大规模数据集,其任务是让模型判断两张视觉上相似的图像是否描绘了同一物体,从而促使模型关注细微的视觉线索。该数据集涵盖了日常物体在不同情境、视角和外观下的多样性。在TWIN上对VLMs进行微调,即使在未见过的领域(如艺术、动物、植物和地标)也能显著提升细粒度识别能力。为量化这些提升,我们引入了FGVQA,这是一个包含1.2万个查询的基准测试套件,它重新利用了来自多个领域的细粒度识别和检索数据集。虽然现有VLMs在FGVQA上表现不佳,但在TWIN上微调后,其性能提升高达19.3%,且不影响在通用VQA基准测试上的表现。最后,我们的TWIN数据集能有效利用物体标注进行扩展,分析表明规模是性能提升的关键。我们期望TWIN能作为开源VLM训练语料库的即插即用补充,推动未来模型感知精度的进步。项目网页:https://glab-caltech.github.io/twin/