Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

翻译：音频到图像检索为生物声学物种识别中的纯音频分类提供了一种可解释的替代方案，但由于配对音频-图像数据的稀缺性，学习对齐的音频-图像表示具有挑战性。我们提出了一种简单且数据高效的方法，无需任何音频-图像监督即可实现音频到图像检索。所提方法以文本为语义中介：通过对比目标微调音频编码器，将预训练图像-文本模型（BioCLIP-2）中编码丰富视觉与分类学结构的文本嵌入空间，蒸馏至预训练音频-文本模型（BioLingual）。此蒸馏将视觉基础语义迁移至音频表示，从而在未使用图像训练的情况下诱导出音频与图像嵌入之间的新兴对齐。我们基于多个生物声学基准评估了所得模型。蒸馏后的音频编码器在保留音频判别能力的同时，显著提升了针对焦点录音与声景数据集的音频-文本对齐性能。最重要的是，在SSW60基准中，所提方法尽管未使用配对音频-图像数据训练，仍实现了超越基于零样本模型组合或文本嵌入映射基线的强音频到图像检索性能。这些结果表明，通过文本进行的间接语义迁移足以诱导有意义的音频-图像对齐，为数据稀缺的生物声学场景下的视觉基础物种识别提供了实用解决方案。