Computational paralinguistics (ComParal) aims to develop algorithms and models to automatically detect, analyze, and interpret non-verbal information from speech communication, e. g., emotion, health state, age, and gender. Despite its rapid progress, it heavily depends on sophisticatedly designed models given specific paralinguistic tasks. Thus, the heterogeneity and diversity of ComParal models largely prevent the realistic implementation of ComParal models. Recently, with the advent of acoustic foundation models because of self-supervised learning, developing more generic models that can efficiently perceive a plethora of paralinguistic information has become an active topic in speech processing. However, it lacks a unified evaluation framework for a fair and consistent performance comparison. To bridge this gap, we conduct a large-scale benchmark, namely ParaLBench, which concentrates on standardizing the evaluation process of diverse paralinguistic tasks, including critical aspects of affective computing such as emotion recognition and emotion dimensions prediction, over different acoustic foundation models. This benchmark contains ten datasets with thirteen distinct paralinguistic tasks, covering short-, medium- and long-term characteristics. Each task is carried out on 14 acoustic foundation models under a unified evaluation framework, which allows for an unbiased methodological comparison and offers a grounded reference for the ComParal community. Based on the insights gained from ParaLBench, we also point out potential research directions, i.e., the cross-corpus generalizability, to propel ComParal research in the future. The code associated with this study will be available to foster the transparency and replicability of this work for succeeding researchers.
翻译:副语言计算旨在开发算法与模型,以自动检测、分析和解释语音交流中的非语言信息,例如情感、健康状况、年龄和性别。尽管该领域发展迅速,但其严重依赖于针对特定副语言任务精心设计的模型。因此,副语言计算模型的异构性和多样性在很大程度上阻碍了其实际应用。近年来,随着自监督学习推动声学基础模型的出现,开发能够高效感知大量副语言信息的通用模型已成为语音处理领域的热点课题。然而,目前缺乏一个统一的评估框架来进行公平且一致的性能比较。为弥补这一空白,我们构建了一个大规模基准,即ParaLBench,其重点在于标准化不同副语言任务的评估流程,涵盖情感计算的关键方面(如情感识别与情感维度预测),并基于多种声学基础模型进行评测。该基准包含十个数据集,涉及十三项不同的副语言任务,覆盖短、中、长期语音特征。每项任务均在统一的评估框架下对14种声学基础模型进行测试,从而实现无偏的方法学比较,并为副语言计算社区提供可靠的参考依据。基于从ParaLBench获得的洞察,我们还指出了潜在的研究方向,即跨语料库的泛化能力,以推动未来副语言计算研究的发展。本研究的关联代码将公开提供,以促进后续研究的透明度和可复现性。