Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.
翻译:音频自编码器能够学习到有用且压缩的音频表示,但其非线性潜在空间阻碍了直观的代数操作,如混合或缩放。我们引入了一种简单的训练方法,通过数据增强在高压缩一致性自编码器(CAE)中诱导线性特性,从而在不改变模型架构或损失函数的情况下实现齐次性(对标量增益的等变性)和可加性(解码器保持加法运算)。采用我们的方法训练后,CAE在编码器和解码器中均表现出线性行为,同时保持重构保真度。我们通过简单的潜在算术在音乐源合成与分离任务中测试了所学空间的实用价值。本研究提出了一种构建结构化潜在空间的直接技术,为实现更直观高效的音频处理提供了可能。