Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated the ability to capture a wide range of visual concepts in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to capture semantic correlations between categories, leading to suboptimal performance on open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the capture of target semantics. To address these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C$^2$SRT) framework to model semantic correlations both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively select a set of local discriminative regions that represent the semantics of the target category. The IST module adaptively discovers a set of correlated categories for a target category by constructing a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Experiments on OV-MLR benchmarks demonstrate that the proposed C$^2$SRT framework improves over current methods.
翻译:暂无翻译