We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.
翻译:我们介绍了一个用于揭示语音数据集中录音特征与目标类别之间虚假相关性的工具包。虚假相关性可能源于异质的录音条件,这在健康相关数据集中是一种常见场景。当这种相关性同时存在于训练数据和测试数据中时,会导致系统性能的高估——这在高风险应用中尤其危险,因为此类应用要求系统满足最低性能要求。我们的工具包实现了一种诊断方法,该方法仅利用音频中的非语音区域来检测目标类别。若此任务的性能高于随机水平,则表明可以从非语音区域提取目标类别的信息,从而提示虚假相关性的存在。该工具包已公开供研究使用。