Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within $1\%$ of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.
翻译:离散音频表示,即音频标记化,因其在促进文本语言建模方法应用于音频领域的潜力而重新受到关注。为此,研究者提出了多种基于压缩和表示学习的标记化方案。然而,与成熟的梅尔频谱图特征相比,压缩型音频标记在说话人和语音相关任务中的性能研究仍较有限。本文在三个任务上评估了压缩型音频标记的性能:说话人确认、说话人日志(多语言)语音识别。研究结果表明:(i)基于音频标记训练的模型在所有考虑的任务上平均表现与梅尔频谱图特征相差在$1\%$以内,但尚未超越后者;(ii)这些模型在域外窄带数据上表现出鲁棒性,尤其在说话人任务中;(iii)相比梅尔频谱图特征,音频标记可实现高达20倍的压缩比,且在与语音和说话人相关的任务中性能损失极小,这对低比特率应用至关重要;(iv)所研究的基于残差向量量化(RVQ)的音频标记器呈现低通频率响应特性,这为观察到的结果提供了合理解释,并为未来标记器设计提供了启示。