The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
翻译:快速学习新任务的能力——即小样本学习——是智能体的核心特征。经典小样本基准使用来自单一模态的小样本数据,但此类样本可能不足以描述整个概念类别。相比之下,人类利用跨模态信息高效学习新概念。在本工作中,我们证明通过**阅读**关于狗的描述并**聆听**其叫声,确实可以构建更好的**视觉**狗分类器。为此,我们利用CLIP等近期多模态基础模型内在的跨模态特性,将不同模态映射至同一表征空间。具体而言,我们提出一种简单的跨模态适配方法,能从涉及不同模态的小样本示例中进行学习。通过将类别名称重新用作额外的单样本训练样本,我们仅凭一个极其简单的线性分类器即可在视觉-语言适配中达到最优性能。此外,我们证明该方法可惠及前缀微调、适配器及分类器集成等现有技术。最后,为探索视觉和语言之外的其他模态,我们构建了首个(据我们所知)视听小样本基准,并通过跨模态训练提升图像与音频分类的性能。