Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.
翻译:现有关于小样本学习的研究大多集中于单模态场景,其目标是通过同一模态的少量标注样本训练模型以泛化至未见数据。然而,现实世界数据本质上是多模态的,单模态方法限制了小样本学习的实际应用。为弥补这一不足,本文提出了跨模态小样本学习任务,旨在仅使用少量标注样本的情况下识别来自多个模态的实例。由于不同模态具有各自独特的视觉特征与结构属性,该任务相较于经典小样本学习面临更多挑战。为应对这些挑战,我们提出了一种生成式迁移学习框架,该框架包含两个阶段:第一阶段利用丰富的单模态数据进行训练,第二阶段则通过迁移学习适应新数据。我们的GTL框架在两个阶段中联合估计跨模态的潜在共享概念与模态内扰动,并在迁移阶段冻结生成模块,以保持已学习表征的稳定性,避免对有限多模态样本的过拟合。实验结果表明,在Sketchy、TU-Berlin、Mask1K和SKSF-A四个不同的多模态数据集上,GTL框架相比现有最优方法具有更优越的性能。此外,结果还表明该模型能够从海量单模态数据中估计潜在概念,并仅利用少量可用样本将这些概念泛化至未见模态,这一过程与人类认知机制高度相似。