The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
翻译:快速通过少量指令学习新任务的能力——即少样本学习——是智能体的核心特征。经典少样本基准测试仅利用单一模态的少量样本,但此类样本可能不足以完整表征整个概念类别。相比之下,人类通过跨模态信息来高效学习新概念。本研究表明,通过**阅读**关于狗的文字描述并**聆听**犬吠声,确实能够构建更优的**视觉**犬类分类器。为实现这一目标,我们利用CLIP等近期多模态基础模型所学习的跨模态编码器特性——这些编码器能将不同模态映射到同一表示空间。具体而言,我们提出一种简单的**跨模态适应**策略:将来自不同模态的示例视为额外的少样本训练数据。例如,通过将类别名称直接转化为额外训练样本,我们即可将任意n样本学习问题转化为(n+1)样本问题。这种策略使我们能够以极其简单的线性分类器获得最先进的性能表现。我们证明该方法可与前缀调优、适配器、分类器集成等现有技术结合使用。最后,为探索视觉与语言之外的其他模态,我们构建了首个(据我们所知)视听少样本基准测试集,并通过跨模态训练提升了图像与音频分类任务的性能。