This technical report describes our submission to the ICME 2025 audio encoder challenge. Our submitted system is built on BEATs, a masked speech token prediction based audio encoder. We extend the BEATs model using 74,000 hours of data derived from various speech, music, and sound corpora and scale its architecture upto 300 million parameters. We experiment with speech-heavy and balanced pre-training mixtures to study the impact of different domains on final performance. Our submitted system consists of an ensemble of the Dasheng 1.2 billion model with two custom scaled-up BEATs models trained on the aforementioned pre-training data mixtures. We also propose a simple ensembling technique that retains the best capabilities of constituent models and surpasses both the baseline and Dasheng 1.2B. For open science, we publicly release our trained checkpoints via huggingface at https://huggingface.co/shikhar7ssu/OpenBEATs-ICME-SOUND and https://huggingface.co/shikhar7ssu/OpenBEATs-ICME.
翻译:本技术报告介绍了我们提交给ICME 2025音频编码器挑战的方案。我们提交的系统基于BEATs构建,这是一种基于掩码语音标记预测的音频编码器。我们利用从多种语音、音乐和声音语料库中提取的74,000小时数据扩展了BEATs模型,并将其架构规模扩展至3亿参数。我们通过语音主导型与均衡型预训练混合数据的实验,研究了不同领域对最终性能的影响。我们提交的系统由达声12亿参数模型与两个在上述预训练数据混合上训练的自定义扩展BEATs模型集成构成。我们还提出了一种简单的集成技术,该技术保留了各组成模型的最佳能力,并超越了基线模型与达声12亿参数模型。为促进开放科学,我们已通过Hugging Face平台公开发布训练好的模型检查点,地址为:https://huggingface.co/shikhar7ssu/OpenBEATs-ICME-SOUND 与 https://huggingface.co/shikhar7ssu/OpenBEATs-ICME。