Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94\% for sex, 83.65\% for age, and 40.48\% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.
翻译:语音交互界面已被广泛应用,然而,由于持续存在的人口统计偏差,在跨不同说话者群体中实现公平的唤醒词检测仍是一项关键挑战。本研究评估了人口统计无关训练技术在缓解不同性别、年龄和口音说话者间性能差异方面的有效性。我们使用OK Aura数据库进行实验,采用排除人口统计标签(仅用于评估目的)的训练方法。我们探索了:(i) 增强模型泛化的数据增强技术,(ii) 对预训练基础语音模型进行知识蒸馏。实验结果表明,这些人口统计无关训练技术显著降低了人口统计偏差,使得不同说话者群体的性能特征更加均衡。具体而言,与基线相比,其中一种评估技术实现了性别预测差异降低39.94%、年龄降低83.65%以及口音降低40.48%。本研究凸显了标签无关方法在促进唤醒词检测公平性方面的有效性。