Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.
翻译:音频自编码器能够学习到有用且压缩的音频表示,但其非线性潜在空间阻碍了直观的代数操作,如混合或缩放。我们提出了一种简单的训练方法,通过数据增强在高压缩的一致性自编码器(CAE)中引入线性特性,从而在不改变模型架构或损失函数的情况下诱导齐次性(对标量增益的等变性)和可加性(解码器保持加法运算)。采用我们的方法训练后,CAE在编码器和解码器中均表现出线性行为,同时保持重建保真度。我们通过简单的潜在算术在音乐源合成与分离任务上测试了所学空间的实际效用。这项工作提出了一种构建结构化潜在空间的直接技术,能够实现更直观和高效的音频处理。