Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.
翻译:基于变分自编码器(VAE)的连续语音表征已成为传统频谱图或离散词元特征在语音生成与重构任务中的一种有前景的替代方案。近期研究尝试通过与自监督学习(SSL)特征对齐来丰富VAE潜在表征中的结构信息,以期获得更优的生成性能。然而,当考虑更多任务时,目前广泛采用的基于时间轴蒸馏的对齐方法是否最优仍不明确。针对这一问题,本文系统性地探索了不同对齐方法,并从重构、理解和生成三个维度分析其对性能的影响。我们研究了蒸馏损失中的多种设计选择。大量实验表明,采用自适应权重联合边际对齐方法能够在实现整体最优性能的同时,灵活控制任务间的平衡。