Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.
翻译:口音转换旨在将源语音的口音转换为目标口音,同时保留说话人的身份特征。本文提出了一种新颖的非自回归口音转换框架,该框架学习与口音无关的语音表征,并利用这些表征对源语音进行口音转换。具体而言,所提系统将语音表征与从文本转语音系统获得的语音表征进行对齐,从而能够在非平行数据上训练口音语音转换模型。此外,我们探究了在自然语音数据上的预训练策略以及不同声学特征在所提框架中的有效性。我们采用主观与客观指标进行全面评估以衡量方法的性能。评估结果表明,预训练策略与更丰富语义特征的结合显著提升了音频质量与可懂度。