In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.
翻译:本研究定义并解决了零样本“实物”描述分类这一新颖任务,该任务评估CLIP等视觉语言模型仅依据描述性属性(排除物体类别名称)对物体进行分类的能力。该方法揭示了当前VLM在理解复杂物体描述方面的局限性,推动这些模型超越单纯的物体识别范畴。为促进相关探索,我们提出了新的挑战任务,并发布了六个流行细粒度基准测试的描述数据集,这些数据刻意省略物体名称以推动研究社区实现真正的零样本学习。此外,我们提出一种通过针对性训练增强CLIP属性检测能力的方法:利用ImageNet21k的多样化物体类别,配合大语言模型生成的丰富属性描述进行训练。进一步地,我们提出改进的CLIP架构,通过利用多分辨率特征提升细粒度部件属性的检测能力。通过这些工作,我们拓展了对CLIP部件属性识别机制的理解,使其在六个流行细粒度基准测试以及广泛使用的物体属性识别基准PACO数据集上的分类性能均获得提升。代码已开源:https://github.com/ethanbar11/grounding_ge_public。