Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
翻译:自回归"语言"模型(LMs)在原始波形数据上训练后,可被重新用于无损音频压缩。然而,先前的研究仅限于8位音频,未能解答此类方法在实际场景(16/24位)中是否有效,以及能否与现有编解码器竞争。我们在全保真音频上对基于LM的压缩方法进行了基准测试,覆盖多领域(音乐、语音、生物声学)、采样率(16kHz-48kHz)和位深度(8、16、24位)。标准样本级标记化在较高位深度下因词汇量过大(16位对应65K;24位对应1670万)而难以实现。为此,我们提出Trilobyte——一种面向全分辨率音频的字节级标记化方案,将词汇量增长从$O(2^{b})$改进至$O(1)$,首次实现了可处理的24位基于LM的无损压缩。实验表明,虽然LM在8位和16位音频上持续优于FLAC并达到最先进的压缩性能,但当位深度超过8位时,其压缩增益会逐渐趋于平缓。