Zero-shot inference, where pre-trained models perform tasks without specific training data, is an exciting emergent ability of large models like CLIP. Although there has been considerable exploration into enhancing zero-shot abilities in image captioning (IC) for popular datasets such as MSCOCO and Flickr8k, these approaches fall short with fine-grained datasets like CUB, FLO, UCM-Captions, and Sydney-Captions. These datasets require captions to discern between visually and semantically similar classes, focusing on detailed object parts and their attributes. To overcome this challenge, we introduce TRaining-Free Object-Part Enhancement (TROPE). TROPE enriches a base caption with additional object-part details using object detector proposals and Natural Language Processing techniques. It complements rather than alters the base caption, allowing seamless integration with other captioning methods and offering users enhanced flexibility. Our evaluations show that TROPE consistently boosts performance across all tested zero-shot IC approaches and achieves state-of-the-art results on fine-grained IC datasets.
翻译:零样本推理,即预训练模型无需特定训练数据即可执行任务,是CLIP等大型模型令人兴奋的涌现能力。尽管在提升MSCOCO和Flickr8k等流行数据集的零样本图像描述能力方面已有大量探索,但这些方法在处理CUB、FLO、UCM-Captions和Sydney-Captions等细粒度数据集时表现不足。这些数据集要求描述能够区分视觉和语义相似的类别,重点关注细粒度的对象部件及其属性。为克服这一挑战,我们提出了无需训练的对象部件增强方法(TROPE)。TROPE利用对象检测器提案和自然语言处理技术,为基础描述补充额外的对象部件细节。该方法是对基础描述的补充而非修改,可实现与其他描述方法的无缝集成,并为用户提供更高的灵活性。评估结果表明,TROPE在所有测试的零样本图像描述方法中均能持续提升性能,并在细粒度图像描述数据集上取得了最先进的结果。