From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.

翻译：大规模、开放许可的语音数据集对于构建自动语音识别（ASR）系统至关重要，然而许多广泛使用的语言在公共资源中仍未被充分代表。普什图语拥有超过6000万使用者，历史上一直缺乏适用于现代ASR开发的大规模开放许可语音数据。本文对Mozilla Common Voice语料库中的普什图语部分进行发布级别分析，聚焦于24.0版本（2025年12月），并梳理主要发布版本的趋势脉络。我们记录了从2023年中期的1.49录制小时到2025年总计2768.7小时的快速增长，其中包括975.89小时已验证数据可用于监督式ASR训练。除规模外，我们分析了验证吞吐量、贡献者参与不平等性、人口统计元数据完整性以及已验证子集中的句子级别集中度。研究发现：参与度高度集中（基尼系数=0.941），年龄代表性严重偏向青年群体，41.97%的语音片段缺乏自我报告的性别标签，这限制了基于元数据的子群体审计。在文本层面，提示句复用程度适中：35.88%的唯一句子占据了50%的已验证语音片段，表明结构集中性主要源于贡献者活动的不均衡分布，而非少量提示句的主导作用。这些结果为快速扩展的低资源语音语料库提供了量化审计，并指出了提升数据集成熟度的实践重点，包括扩展验证能力和拓宽人口统计参与范围。