We demonstrate that vision language models (VLMs) are capable of recognizing the content in audio recordings when given corresponding spectrogram images. Specifically, we instruct VLMs to perform audio classification tasks in a few-shot setting by prompting them to classify a spectrogram image given example spectrogram images of each class. By carefully designing the spectrogram image representation and selecting good few-shot examples, we show that GPT-4o can achieve 59.00% cross-validated accuracy on the ESC-10 environmental sound classification dataset. Moreover, we demonstrate that VLMs currently outperform the only available commercial audio language model with audio understanding capabilities (Gemini-1.5) on the equivalent audio classification task (59.00% vs. 49.62%), and even perform slightly better than human experts on visual spectrogram classification (73.75% vs. 72.50% on first fold). We envision two potential use cases for these findings: (1) combining the spectrogram and language understanding capabilities of VLMs for audio caption augmentation, and (2) posing visual spectrogram classification as a challenge task for VLMs.
翻译:我们证明,当给定相应的谱图图像时,视觉语言模型(VLMs)能够识别音频记录中的内容。具体而言,我们通过提示模型在给定每个类别的示例谱图图像后对谱图图像进行分类,指导VLMs在少样本设置下执行音频分类任务。通过精心设计谱图图像表示并选择良好的少样本示例,我们展示了GPT-4o在ESC-10环境声音分类数据集上能够达到59.00%的交叉验证准确率。此外,我们证明在当前可用的具备音频理解能力的商业音频语言模型(Gemini-1.5)上,VLMs在等效的音频分类任务中表现更优(59.00% vs. 49.62%),甚至在视觉谱图分类上略优于人类专家(在首次折叠上达到73.75% vs. 72.50%)。我们展望了这些发现的两种潜在应用场景:(1)结合VLMs的谱图与语言理解能力进行音频描述增强,以及(2)将视觉谱图分类作为VLMs的挑战任务提出。