Like face recognition, speaker recognition is widely used for voice-based biometric identification in a broad range of industries, including banking, education, recruitment, immigration, law enforcement, healthcare, and well-being. However, while dataset evaluations and audits have improved data practices in computer vision and face recognition, the data practices in speaker recognition have gone largely unquestioned. Our research aims to address this gap by exploring how dataset usage has evolved over time and what implications this has on bias and fairness in speaker recognition systems. Previous studies have demonstrated the presence of historical, representation, and measurement biases in popular speaker recognition benchmarks. In this paper, we present a longitudinal study of speaker recognition datasets used for training and evaluation from 2012 to 2021. We survey close to 700 papers to investigate community adoption of datasets and changes in usage over a crucial time period where speaker recognition approaches transitioned to the widespread adoption of deep neural networks. Our study identifies the most commonly used datasets in the field, examines their usage patterns, and assesses their attributes that affect bias, fairness, and other ethical concerns. Our findings suggest areas for further research on the ethics and fairness of speaker recognition technology.
翻译:如同人脸识别一样,说话人识别被广泛应用于银行、教育、招聘、移民、执法、医疗保健及健康监测等多个行业的语音生物特征识别中。然而,尽管数据集评估与审计已改善了计算机视觉和人脸识别领域的数据实践,说话人识别领域的数据实践却鲜受质疑。本研究旨在通过探索数据集使用随时间演化的趋势及其对说话人识别系统偏差与公平性的影响,填补这一空白。已有研究表明,主流说话人识别基准中存在历史偏差、表征偏差和测量偏差。本文针对2012年至2021年间用于训练和评估的说话人识别数据集展开纵向研究,调查了近700篇论文,以探究在该关键时期(即说话人识别方法转向广泛采用深度神经网络的过程中)学界对数据集的采纳情况及使用变迁。我们的研究识别了该领域最常用的数据集,分析了其使用模式,并评估了影响偏差、公平性及其他伦理问题的数据集属性。研究结论为说话人识别技术的伦理与公平性进一步研究提供了方向。