In the realm of few-shot learning, foundation models like CLIP have proven effective but exhibit limitations in cross-domain robustness especially in few-shot settings. Recent works add text as an extra modality to enhance the performance of these models. Most of these approaches treat text as an auxiliary modality without fully exploring its potential to elucidate the underlying class visual features distribution. In this paper, we present a novel approach that leverages text-derived statistics to predict the mean and covariance of the visual feature distribution for each class. This predictive framework enriches the latent space, yielding more robust and generalizable few-shot learning models. We demonstrate the efficacy of incorporating both mean and covariance statistics in improving few-shot classification performance across various datasets. Our method shows that we can use text to predict the mean and covariance of the distribution offering promising improvements in few-shot learning scenarios.
翻译:在少样本学习领域,CLIP等基础模型虽已证明有效,但在跨域鲁棒性方面存在局限,尤其体现在少样本场景中。近期研究通过引入文本作为额外模态来增强这类模型的性能,但多数方法仅将文本视为辅助模态,未能充分挖掘其在阐明潜在类别视觉特征分布方面的潜力。本文提出一种创新方法,利用文本推导出的统计量来预测各类别视觉特征分布的均值与协方差。该预测框架丰富了潜在空间,从而构建出更鲁棒且更具泛化性的少样本学习模型。我们通过整合均值与协方差统计量,证明了其在提升多数据集少样本分类性能上的有效性。实验表明,利用文本预测分布均值与协方差的策略,为少样本学习场景带来了显著性能提升。