Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.
翻译:自动语音识别(ASR)系统旨在将口语转化为书面文本,并广泛应用于语音助手和转录服务等领域。然而,研究发现,尽管最先进的ASR系统在基准测试中表现优异,但由于发音特性的差异,它们在识别某些地区或人群的语音时仍存在困难。本文描述了一个大规模语音数据集的建设过程,该数据集包含约9.8K场英语技术讲座,时长8740小时,并配有由代表印度不同地域背景的讲师提供的转录文本。该数据集来源于广受欢迎的NPTEL慕课平台。我们利用这一数据集,测量了YouTube自动字幕与OpenAI Whisper模型在印度不同人口特征群体中的性能差异。尽管性别、籍贯、年龄和语速因素会导致性能差异,但基于种姓的差异并不存在。我们还观察到,不同学科讲座间的差异具有统计显著性。这些结果表明,亟需开发更具包容性和鲁棒性的ASR系统,并构建更具代表性的数据集以评估系统中的偏差。