Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.
翻译:音频分词器是统一音频理解与生成的基础。理解任务需要高层语义信息,而生成任务则同时需要语义与声学细节。现有统一分词器将两者共同编码至高维连续潜在空间,这增加了扩散变换器(DiTs)进行生成时的建模负担。我们提出LoSATok——一种面向跨域音频理解与生成的低维音频分词器。受1280维语义编码器特征具有可压缩性这一观察启发,我们引入语义瓶颈(Semantic Bottleneck)将其压缩至128维,并通过提出的时间关系损失函数(time-relation loss)进行正则化以保证时间特征一致性。我们进一步设计了一种双层语义监督方法,同时利用高维与低维语义信号,使分词器能够在紧凑的潜在空间中联合捕获语义与声学细节。在语音、音乐及通用音频上的实验表明:SemBo在低维下仍保持强大的语义表征能力,LoSATok在保持与多种语义表征相当的竞争性理解性能的同时,在语音、音乐及音频生成任务中持续提升DiT建模性能。这些结果证明LoSATok的低维表征能有效支持音频理解与生成。我们的代码见https://github.com/wxzyd123/LoSATok。