Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data.
翻译:口音标准化系统常因训练数据欠佳与时长建模僵化而产生输出不自然及内容失真问题。本文提出一种"源合成"训练数据构建方法:通过生成第二语言源语音,并以真实母语语音作为训练目标,该方法避免了从TTS伪影中学习,且关键优势在于训练过程中无需真实第二语言数据。基于此数据策略,我们提出CosyAccent——一种非自回归模型,可解决韵律自然度与时长控制间的权衡问题。该模型通过隐式韵律建模保持灵活性,同时提供对输出总时长的显式控制。实验表明,尽管未使用任何真实第二语言语音进行训练,CosyAccent在内容保真度与自然度方面均显著优于基于真实数据训练的强基线模型。