Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.
翻译:当前语音建模主要依赖于音高、内容和说话人身份等显式属性,但这些属性本身无法完整捕捉自然语音的全部丰富特征。我们提出RT-MAE,一种新颖的掩码自编码器框架,通过引入无监督的可训练残差标记来增强基于显式属性的监督建模。这些残差标记旨在编码未被显式标注因素(如音色变化、噪声、情感等)解释的信息。实验表明,RT-MAE提升了重建质量,在保持内容与说话人相似度的同时增强了表达力。我们进一步验证了其在语音增强任务中的适用性,能够在推理阶段有效去除噪声,同时保持可控性与自然度。