Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets
翻译:连续音频自编码器虽能良好重建波形,但其产生的隐变量结构弱,不利于理解任务;而自监督音频编码器虽能捕获语义信息,却无法直接解码。这种矛盾使得需要同时支持理解与生成的单一音频分词器面临困境。我们通过两个组件将连续自编码器的隐变量适配至该场景:一个噪声正则化的自编码器瓶颈模块,以及一个隐变量侧表示编码器。瓶颈模块采用通道归一化与随机扰动替代基于KL的变分训练,生成尺度可控的连续隐变量以支持重建与自回归生成。表示编码器在冻结的自编码器隐变量上通过RQ-MTP与冻结大语言模型监督进行训练。最终得到的分词器可为理解任务提供高维表示,同时保留归一化连续隐变量作为生成目标。