Speaker recognition is a widely used voice-based biometric technology with applications in various industries, including banking, education, recruitment, immigration, law enforcement, healthcare, and well-being. However, while dataset evaluations and audits have improved data practices in face recognition and other computer vision tasks, the data practices in speaker recognition have gone largely unquestioned. Our research aims to address this gap by exploring how dataset usage has evolved over time and what implications this has on bias, fairness and privacy in speaker recognition systems. Previous studies have demonstrated the presence of historical, representation, and measurement biases in popular speaker recognition benchmarks. In this paper, we present a longitudinal study of speaker recognition datasets used for training and evaluation from 2012 to 2021. We survey close to 700 papers to investigate community adoption of datasets and changes in usage over a crucial time period where speaker recognition approaches transitioned to the widespread adoption of deep neural networks. Our study identifies the most commonly used datasets in the field, examines their usage patterns, and assesses their attributes that affect bias, fairness, and other ethical concerns. Our findings suggest areas for further research on the ethics and fairness of speaker recognition technology.
翻译:说话人识别是一种广泛应用的基于语音的生物特征识别技术,在银行、教育、招聘、移民、执法、医疗健康和福祉等多个行业均有应用。然而,尽管数据集评估与审计在面部识别及其他计算机视觉任务中已改善了数据实践,但说话人识别领域的数据实践却鲜少受到质疑。本研究旨在通过探究数据集使用随时间演变的规律,及其对说话人识别系统中偏差、公平性与隐私问题的潜在影响,来弥合这一研究空白。已有研究证实,主流说话人识别基准数据集中存在历史性偏差、表征偏差与测量偏差。本文对2012年至2021年间用于训练和评估的说话人识别数据集进行了纵向研究。我们调研了近700篇论文,以考察该领域在深度神经网络广泛普及的关键时期中,数据集的社区采纳模式及使用变化。本研究识别了该领域最常用的数据集,分析了其使用模式,并评估了影响偏差、公平性及其他伦理问题的数据集属性。研究结果可为说话人识别技术伦理与公平性的进一步研究提供方向。