It is widely known that males and females typically possess different sound characteristics when singing, such as timbre and pitch, but it has never been explored whether these gender-based characteristics lead to a performance disparity in singing voice transcription (SVT), whose target includes pitch. Such a disparity could cause fairness issues and severely affect the user experience of downstream SVT applications. Motivated by this, we first demonstrate the female superiority of SVT systems, which is observed across different models and datasets. We find that different pitch distributions, rather than gender data imbalance, contribute to this disparity. To address this issue, we propose using an attribute predictor to predict gender labels and adversarially training the SVT system to enforce the gender-invariance of acoustic representations. Leveraging the prior knowledge that pitch distributions may contribute to the gender bias, we propose conditionally aligning acoustic representations between demographic groups by feeding note events to the attribute predictor. Empirical experiments on multiple benchmark SVT datasets show that our method significantly reduces gender bias (up to more than 50%) with negligible degradation of overall SVT performance, on both in-domain and out-of-domain singing data, thus offering a better fairness-utility trade-off.
翻译:众所周知,男性和女性在歌唱时通常具有不同的声音特征(如音色和音高),但此前从未探究这些基于性别的特征是否会导致歌唱声音转录(SVT)的性能差异——而SVT的目标恰包含音高。此类差异可能引发公平性问题,并严重影响下游SVT应用的用户体验。基于此,我们首先证明了SVT系统中存在的女性优势现象,该现象在不同模型和数据集上均有体现。研究发现,造成这一差异的主因是音高分布差异,而非性别数据不平衡。为解决该问题,我们提出使用属性预测器预测性别标签,并通过对抗训练SVT系统强制声学表征具有性别不变性。利用音高分布可能导致性别偏见的先验知识,我们进一步提出通过向属性预测器输入音符事件,实现不同人口统计群体间声学表征的条件对齐。在多个SVT基准数据集上的实验表明,我们的方法显著降低了性别偏见(降幅超过50%),同时仅对整体SVT性能造成可忽略的损失——无论对于领域内还是领域外歌唱数据均如此,从而实现了更优的公平性与效用权衡。