Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
翻译:潜在表示是大多数现代生成模型的核心。在音频领域,它们通常由神经音频编解码器自编码器生成。本文提出SAME(语义对齐音乐自编码器),一种用于立体声音乐和通用音频的自编码器,在保持重建质量和下游生成性能的同时,实现了4096×的时间压缩比。通过将基于Transformer的主干网络与一系列语义正则化方法、相位感知重建损失以及改进的判别器设计相结合,我们实现了这一目标。该架构因其高压缩比和对充分优化的Transformer原语的依赖,带来了显著的计算成本优势。我们以开放权重形式发布了两种变体(大型SAME-L和可在CPU上部署的SAME-S)。