This paper investigates three crucial yet underexplored aspects of the generalization capabilities of neural audio codecs (NACs): (i) whether NACs can generalize to unseen languages during pre-training, (ii) whether speech-only pre-trained NACs can effectively generalize to non-speech applications such as environmental sounds, music, and animal vocalizations, and (iii) whether incorporating non-speech data during pre-training can improve performance on both speech and non-speech tasks. Existing studies typically rely on off-the-shelf NACs for comparison, which limits insight due to variations in implementation. In this work, we train NACs from scratch using strictly controlled configurations and carefully curated pre-training data to enable fair comparisons. We conduct a comprehensive evaluation of NAC performance on both signal reconstruction quality and downstream applications using 11 metrics. Our results show that NACs can generalize to unseen languages during pre-training, speech-only pre-trained NACs exhibit degraded performance on non-speech tasks, and incorporating non-speech data during pre-training improves performance on non-speech tasks while maintaining comparable performance on speech tasks.
翻译:本文研究了神经音频编解码器(NACs)泛化能力的三个关键但尚未充分探索的方面:(i)NACs在预训练期间能否泛化到未见过的语言,(ii)仅基于语音预训练的NACs能否有效泛化到非语音应用(如环境声音、音乐和动物发声),以及(iii)在预训练期间纳入非语音数据是否能同时提升语音和非语音任务的性能。现有研究通常依赖于现成的NACs进行比较,由于实现上的差异,这限制了深入洞察。在本工作中,我们使用严格控制的配置和精心筛选的预训练数据从头开始训练NACs,以实现公平比较。我们使用11项指标对NACs在信号重建质量和下游应用方面的性能进行了全面评估。我们的结果表明:NACs在预训练期间能够泛化到未见过的语言;仅基于语音预训练的NACs在非语音任务上表现出性能下降;而在预训练中纳入非语音数据可以提升非语音任务的性能,同时在语音任务上保持相当的表现。