Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

Representations of AI agents in user interfaces and robotics are predominantly White, not only in terms of facial and skin features, but also in the synthetic voices they use. In this paper we explore some unexpected challenges in the representation of race we found in the process of developing an U.S. English Text-to-Speech (TTS) system aimed to sound like an educated, professional, regional accent-free African American woman. The paper starts by presenting the results of focus groups with African American IT professionals where guidelines and challenges for the creation of a representative and appropriate TTS system were discussed and gathered, followed by a discussion about some of the technical difficulties faced by the TTS system developers. We then describe two studies with U.S. English speakers where the participants were not able to attribute the correct race to the African American TTS voice while overwhelmingly correctly recognizing the race of a White TTS system of similar quality. A focus group with African American IT workers not only confirmed the representativeness of the African American voice we built, but also suggested that the surprising recognition results may have been caused by the inability or the latent prejudice from non-African Americans to associate educated, non-vernacular, professionally-sounding voices to African American people.

翻译：人工智能代理在用户界面和机器人技术中的呈现方式以白人居多，不仅体现在面部和肤色特征上，在合成语音方面也是如此。本文探讨了我们在开发一个旨在模仿受过教育、专业、无地域口音的非裔美国女性声音的美式英语文本转语音系统过程中，所遇到的种族表征方面的意外挑战。本文首先展示了与非裔美国IT专业人士进行的焦点小组讨论结果，其中收集并讨论了创建具有代表性和适当性的TTS系统的指南与挑战，随后探讨了TTS系统开发人员面临的一些技术难题。接着我们描述了两项针对美式英语使用者的研究：参与者无法将正确的种族归属赋予非裔美式英语TTS声音，却能以压倒性的准确率识别出同等质量的白色人种TTS声音的种族。与非裔美国IT从业者的焦点小组讨论不仅证实了我们构建的非裔美式英语声音的代表性，还暗示这些令人惊讶的识别结果可能源于非非裔美国人无法将受过教育、不使用方言、听感专业的声音与非裔美国人群体相联系，或存在潜在的偏见。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日