This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model excels in the 'Seen' setting with 94.42% accuracy and 5.49% EER. These findings highlight a trade-off between model complexity and generalisation, and underscore the importance of architectural choices in fine-grained speaker modelling. Our analysis also reveals the impact of speaker identity, annotation subjectivity, and data imbalance on system performance, pointing to future directions for improving robustness and fairness in timbre attribute detection.
翻译:本文介绍了香港中文大学电子工程系数字信号处理与语音技术实验室为第二十届全国人机语音通讯学术会议(NCMMSC 2025)音色属性检测挑战赛所开发的系统。所提出的系统利用WavLM-Large嵌入与注意力统计池化来提取鲁棒的说话人表征,随后采用两种Diff-Net变体——即前馈神经网络与挤压激励增强残差前馈神经网络——来比较语音对之间的音色属性强度。实验结果表明,WavLM-Large+FFN系统在未见说话人上泛化能力更佳,取得了77.96%的准确率与21.79%的等错误率;而WavLM-Large+SE-ResFFN模型在“可见”说话人设置下表现更优,准确率达到94.42%,等错误率为5.49%。这些发现揭示了模型复杂度与泛化能力之间的权衡,并强调了架构选择在细粒度说话人建模中的重要性。我们的分析还揭示了说话人身份、标注主观性以及数据不平衡对系统性能的影响,为未来提升音色属性检测的鲁棒性与公平性指明了方向。