The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
翻译:快速学习新任务的能力——即小样本学习——是智能体的核心特征。经典的小样本基准测试仅使用单一模态的少量样本,但此类样本可能不足以刻画完整的概念类别。相比之下,人类通过跨模态信息高效学习新概念。在本研究中,我们证明通过**阅读**关于狗的描述并**聆听**狗的叫声,确实可以构建更好的**视觉**狗分类器。为此,我们利用近年多模态基础模型(如CLIP)天然具备的跨模态特性——将不同模态映射至同一表征空间。具体而言,我们提出一种简单的跨模态适配方法,通过跨模态的少量样本来学习。通过将类别名称重新利用为额外的单样本训练样本,我们采用极其简单的线性分类器进行视觉-语言适配,即达到当前最优性能。此外,我们证明该方法可提升前缀微调、适配器及分类器集成等现有技术的表现。最后,为探索视觉与语言之外的其他模态,我们构建了首个(据我们所知)音频-视觉小样本基准,并通过跨模态训练提升图像与音频分类的性能。