Text-to-feature diffusion for audio-visual few-shot learning

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

翻译：从音视频数据训练视频分类深度学习模型通常需要大量昂贵标注过程收集的训练数据。一个具有挑战性且尚未充分探索但成本更低的方法是视频数据的小样本学习。特别地，视频数据中声音与视觉信息固有的多模态特性尚未被广泛用于小样本视频分类任务。因此，我们在三个数据集（即VGGSound-FSL、UCF-FSL、ActivityNet-FSL数据集）上引入统一的音视频小样本视频分类基准，并适配和比较了十种方法。此外，我们提出AV-DIFF，一种文本到特征扩散框架，该框架首先通过跨模态注意力融合时间与音视频特征，然后为新类生成多模态特征。我们证明AV-DIFF在我们提出的音视频（广义）小样本学习基准上取得了最先进的性能。我们的基准为仅有限标注数据可用时的有效音视频分类铺平了道路。代码和数据可在https://github.com/ExplainableML/AVDIFF-GFSL获取。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

【机器学习术语宝典】机器学习中英文术语表

专知会员服务

61+阅读 · 2020年7月12日