Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.
翻译:副语言语音处理在情绪分析、神经认知障碍分析等众多问题中具有重要价值。近年来,Transformer在自然语言处理领域取得了显著成功,并已展现出在语音领域的适配性。然而,此前语音领域中关于Transformer的研究未能融入语音特性,未能充分挖掘Transformer的潜力。本文考虑语音信号特性,提出了一种基于通用结构的框架SpeechFormer++,专用于副语言语音处理。具体而言,我们遵循语音信号中的成分关系,设计了一种单元编码器,以高效建模单元内与单元间信息(即帧、音素和单词)。依据层级关系,我们利用合并模块生成不同粒度的特征,这与语音信号中的结构模式保持一致。此外,引入单词编码器将单词粒度特征融合至各单元编码器中,有效平衡了细粒度与粗粒度信息。SpeechFormer++在语音情感识别(IEMOCAP & MELD)、抑郁分类(DAIC-WOZ)及阿尔茨海默病检测(Pitt)任务上进行了评估。结果表明,SpeechFormer++在显著降低计算成本的同时,性能优于标准Transformer。此外,与现有最优方法相比,该方法取得了更优结果。