Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.
翻译:当前指令式文本转语音系统中的偏见评估多采用单变量测试方法,忽视了社交线索的复合结构。本研究通过将提示词建模为社会地位、职业刻板印象和人物描述符的组合,系统考察了性别偏见。对开源指令式文本转语音模型的分析揭示了系统性交互效应——社会维度相互调节,形成了单变量基线方法无法捕捉的复杂偏见模式。关键发现表明,这些偏见不仅局限于表层痕迹,还与预训练文本编码器的语义先验知识及训练数据固有的分布偏差存在强关联。我们进一步证实,通用多样性提示策略不足以克服这些根深蒂固的模式,凸显了采用复合分析方法诊断生成式语音潜在风险的必要性。