Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default towards dominant phonetic patterns, and linguistic bias, a misalignment in dialect-specific lexical or cultural information. These biases are interdependent and authentic accent generation requires both accent fidelity and correctly localized text. We present CLARITY (Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis), a backbone-agnostic framework to address both biases through dual-signal optimization. Firstly, we apply contextual linguistic adaptation to localize input text to align with the target dialect. Secondly, we propose retrieval-augmented accent prompting (RAAP) to ensure accent-consistent speech prompts. We evaluate CLARITY on twelve varieties of English accent via both subjective and objective analysis. Results clearly indicate that CLARITY improves accent accuracy and fairness, ensuring higher perceptual quality output\footnote{Code and audio samples are available at https://github.com/ICT-SIT/CLARITY.
翻译:指令引导的文本到语音(TTS)研究已达到成熟水平,能够按需生成高质量的语音,但两种相互关联的偏差仍然存在,降低了感知质量:口音偏差(模型倾向于默认的主导语音模式)和语言偏差(方言特定的词汇或文化信息存在错位)。这些偏差相互依存,真实口音的生成既需要口音保真度,也需要正确本地化的文本。我们提出了CLARITY(面向包容性TTS合成的上下文语言适应与检索),这是一个与主干模型无关的框架,通过双信号优化来解决这两种偏差。首先,我们应用上下文语言适应来本地化输入文本,使其与目标方言对齐。其次,我们提出了检索增强的口音提示(RAAP),以确保口音一致的语音提示。我们通过主观和客观分析,在十二种英语口音变体上评估了CLARITY。结果明确表明,CLARITY提高了口音准确性和公平性,确保了更高的感知质量输出\footnote{代码和音频样本可在 https://github.com/ICT-SIT/CLARITY 获取。}。