Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

翻译：音频到图像检索为生物声学物种识别提供了一种比纯音频分类更具可解释性的替代方案，但由于配对音频-图像数据的稀缺性，学习对齐的音频-图像表征具有挑战性。本文提出一种简单且数据高效的方法，无需任何音频-图像监督即可实现音频到图像检索。所提方法使用文本作为语义中介：通过对比学习目标微调预训练音频-文本模型（BioLingual）的音频编码器，将预训练图像-文本模型（BioCLIP-2）的文本嵌入空间（编码了丰富的视觉和分类学结构）蒸馏至该模型。这种蒸馏过程将视觉基础语义转移到音频表征中，在训练过程中不使用图像的情况下，诱导音频与图像嵌入之间出现对齐。我们在多个生物声学基准上评估所得模型。蒸馏后的音频编码器在保持音频判别能力的同时，显著提升了在焦点录音和声景数据集上的音频-文本对齐效果。最重要的是，在SSW60基准测试中，所提方法实现了强大的音频到图像检索性能，超越了基于零样本模型组合或文本嵌入间学习映射的基线方法，尽管该方法未使用配对音频-图像数据进行训练。这些结果表明，通过文本进行的间接语义传递足以诱导有意义的音频-图像对齐，为数据稀缺的生物声学场景中基于视觉的物种识别提供了实用解决方案。