Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
翻译:基于原始波形训练的自回归"语言"模型可被重新用于无损音频压缩,但先前研究局限于8位音频,未能验证此类方法在实用场景(16/24位)中的有效性及其与现有编解码器的竞争力。我们对基于语言模型的压缩方法进行了全保真音频基准测试,涵盖多领域(音乐、语音、生物声学)、多采样率(16kHz-48kHz)及多比特深度(8位、16位、24位)。标准样本级分词在高比特深度下因词表规模(16位需65K词表,24位需1670万词表)而变得不可行。为此,我们提出Trilobyte——一种面向全分辨率音频的字节级分词方案,将词表规模从$O(2^{b})$优化为$O(1)$,首次实现了24位下可操作的基于语言模型的无损压缩。尽管语言模型在8位和16位场景下持续优于FLAC并达到最优压缩性能,但研究发现在比特深度超过8位后,其压缩增益的改善幅度趋于平缓。