The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments' results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.
翻译:近年来,随着大规模音频数据集的可用性以及深度学习技术的进步,音频字幕生成领域取得了显著进展。本技术报告介绍了我们采用的音频字幕生成方法,重点探讨了预训练语音转文本Whisper模型的使用以及基于合成字幕的预训练策略。我们详细阐述了训练流程,并呈现了实验结果,涵盖模型规模变化、数据集混合及其他超参数的影响。研究结论揭示了不同训练策略对音频字幕模型性能的影响。相关代码与训练模型已在GitHub和Hugging Face Hub平台公开。