Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
翻译:语音模仿旨在将源语音转换为目标说话人的音色和说话风格,同时保留语言内容。一种直接的方法是使用(源语音、参考语音、目标语音)三元组进行训练,其中源语音与目标语音共享相同内容,但目标语音匹配参考语音的语音特征,然而此类数据极为稀缺。现有方法要么采用精心设计的解耦架构来规避数据稀缺问题,要么借助外部系统合成伪平行训练数据。但前者需要复杂的模型设计,后者则在将合成语音作为训练目标时面临质量上限。为克服这些局限,我们提出MimicLM,通过将合成语音作为训练源而保留真实录音作为目标这一创新方法,使模型能够直接学习真实语音分布,突破了合成质量天花板。基于此数据构建策略,我们融入交错文本-音频建模以引导生成内容准确的语音,并通过偏好对齐后训练来缓解合成数据训练时固有的分布不匹配。实验表明,MimicLM凭借简单高效的架构实现了卓越的语音模仿质量,在自然度上显著优于现有方法,同时在说话人身份、口音和情感维度上保持具有竞争力的相似度评分。