Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
翻译:跨域小样本学习通过将大规模通用数据(源域)训练的模型适配至仅有少量训练数据的目标下游领域,其中视觉-语言模型(如CLIP)的相关研究仍处于早期阶段。典型的医学诊断等下游领域需要细粒度视觉线索实现可解释性识别,但我们发现当前微调后的CLIP模型难以聚焦这些线索——尽管它们能大致关注源域中的重要区域。现有工作虽已证明CLIP在捕捉局部细微模式方面的缺陷,本文进一步发现领域差距与稀缺训练数据会加剧此类缺陷,且其影响远超整体模式,我们将此称为基于CLIP的跨域小样本学习中的局部对齐问题。为解决该问题,鉴于缺乏对齐局部视觉特征与文本语义的监督信号,我们转向自监督信息。受翻译任务启发,提出具有循环一致性的CC-CDFSL方法:将局部视觉特征转换为文本特征后再逆转换为视觉特征(反之亦然),并约束原始特征与逆转换后的特征保持接近。为降低视觉模态中丰富信息引入的噪声,进一步提出语义锚点机制:首先增强视觉特征为文本-图像映射提供更大的语料库,继而压缩图像特征以过滤无关的图像-文本映射。在多种基准、主干网络和微调方法上的大量实验表明,本方法能够:(1) 有效提升局部视觉-语言对齐质量;(2) 通过可视化图像块增强所学模式与模型决策的可解释性;(3) 达到业界最先进水平。