Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets is seldom scrutinized for text-dependency. This paper critically evaluates the prevalent assumption that machine learning models trained on such datasets genuinely learn to identify paralinguistic traits, rather than merely capturing lexical features. By examining the lexical overlap in these datasets and testing the performance of machine learning models, we expose significant text-dependency in trait-labeling. Our results suggest that some machine learning models, especially large pre-trained models like HuBERT, might inadvertently focus on lexical characteristics rather than the intended paralinguistic features. The study serves as a call to action for the research community to reevaluate the reliability of existing datasets and methodologies, ensuring that machine learning models genuinely learn what they are designed to recognize.
翻译:副语言特征(如认知负荷和情感)日益被认为是语音识别研究的关键领域,通常通过CLSE和IEMOCAP等专门数据集进行研究。然而,这些数据集的完整性很少受到文本依赖性方面的审查。本文批判性地评估了一个普遍假设,即在这些数据集上训练的机器学习模型真正学会了识别副语言特征,而非仅仅捕捉词汇特征。通过检查这些数据集中的词汇重叠情况并测试机器学习模型的性能,我们揭示了特征标注中显著的文本依赖性。我们的结果表明,一些机器学习模型,尤其是像HuBERT这样的大型预训练模型,可能无意中关注词汇特征,而非预期的副语言特征。本研究旨在呼吁研究界重新评估现有数据集和方法论的可靠性,确保机器学习模型真正学会其设计用于识别的目标。