This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated clustering to introduce an auxiliary classification task for further finetuning. Our best single system achieved a mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on the Clotho development test split.
翻译:本报告介绍了AISTAT团队针对DCASE 2025任务6中基于语言的音频检索任务的提交方案。我们提出的系统采用双编码器架构,其中音频与文本模态分别进行编码,并通过对比学习对齐它们的表征。借鉴去年挑战赛的方法,我们实施了知识蒸馏策略,并利用大语言模型(LLMs)进行有效的数据增强,包括回译和LLM混合技术。此外,我们引入聚类方法构建辅助分类任务以进行进一步微调。在Clotho开发测试集上,我们最佳的单系统取得了46.62的mAP@16分数,而四个系统的集成模型则达到了48.83的mAP@16分数。