Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.
翻译:自动语音识别技术已成为提升执法场景效率的强大工具。为确保不同声学环境下各人口统计学群体的公平性,必须在真实环境中对多种说话人进行ASR引擎测试。然而,如何可靠地描述模型间的公平性差异仍具挑战。同时,现有公开ASR数据集大多难以满足充分的公平性评估需求。为此,我们构建了FairLENS——一个系统化的公平性评估框架。我们提出了一种新颖且可扩展的评估方法,用以检验不同模型间的公平性差异。我们还收集了涵盖多场景与多人口统计学维度的公平性评估数据集。基于该框架,我们对1个开源模型及11个商业前沿ASR模型进行了公平性评估。结果表明,部分模型表现出更强的偏见,这为用户在特定现实场景中选择ASR模型提供了公平性参考依据。我们进一步探究了模型对特定人口群体的偏见,并发现声学领域的偏移可能导致新偏见的产生。