Discrete Speech Representation Tokens (DSRTs) have become a foundational component in speech generation. While prior work has extensively studied phonetic and speaker information in DSRTs, how accent information is encoded in DSRTs remains largely unexplored. In this paper, we present the first systematic investigation of accent information in DSRTs. We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis. Using this framework, we analyse DSRTs derived from a variety of speech encoders. Our results reveal that accent information is substantially reduced when ASR supervision is used to fine-tune the encoder, but cannot be effectively disentangled from phonetic and speaker information through naive codebook size reduction. Based on these findings, we propose new content-only and content-accent DSRTs that significantly outperform existing designs in controllable accent generation. Our work highlights the importance of accent-aware evaluation and provides practical guidance for designing DSRTs for accent-controlled speech generation.
翻译:离散语音表示标记已成为语音生成的基础组成部分。尽管先前研究已广泛探讨了DSRT中的语音信息和说话人信息,但口音信息在DSRT中的编码方式在很大程度上仍未得到探索。本文首次系统性地研究了DSRT中的口音信息。我们提出了一个统一的评估框架,通过新颖的Accent ABX任务测量口音信息的可访问性,并通过跨口音语音转换重合成测量其可恢复性。利用该框架,我们分析了从多种语音编码器衍生的DSRT。结果表明,当使用ASR监督微调编码器时,口音信息会显著减少,但无法通过简单的码本大小缩减有效解耦语音信息和说话人信息。基于这些发现,我们提出了新的仅内容DSRT和内容-口音DSRT,在可控口音生成任务中显著优于现有设计。我们的工作强调了口音感知评估的重要性,并为设计用于口音控制语音生成的DSRT提供了实用指导。