Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.
翻译:实例级识别(ILR)关注于区分不同个体实例,人员重识别是其典型代表。尽管现代视觉语言模型(VLM)具备令人印象深刻的视觉感知能力,我们发现其在ILR任务上的表现不尽如人意,通常显著落后于领域专用的ILR模型。这一局限性阻碍了VLM在许多实际应用中的部署,例如在有效视觉理解中识别熟悉人物与物体至关重要的场景。现有解决方案通常利用实例专用数据集逐个学习识别实例,这不仅带来巨大的数据收集与训练成本,还在细粒度判别方面存在困难。本工作中,我们提出IIR-VLM——一种针对上下文实例级识别增强的视觉语言模型。我们集成预训练的ILR专家模型作为辅助视觉编码器,为学习多样化实例提供专业化特征,使VLM能够以单样本方式在上下文中学习新实例。此外,IIR-VLM利用该知识实现实例感知的视觉理解。我们在现有实例个性化基准测试中验证了IIR-VLM的有效性。最后,我们在一个具有挑战性的新基准上展示了其卓越的ILR性能,该基准通过人员、人脸、宠物及通用物体作为任务实例,评估模型在不同难度与多样类别下的ILR能力。