Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.
翻译:自监督学习(SSL)为语音情感识别(SER)提供了强大且富含上下文的表示,但将这些表示聚合为整体描述符仍是一个瓶颈。传统的一阶聚合隐式假设特征独立,忽略了潜在的黎曼几何结构,并丢弃了对骨干网络表示能力至关重要的高阶关系。为解决此问题,本文提出了一种新颖的二阶相关性(SOC)层。SOC不孤立地处理特征,而是将特征相关性建模为协方差描述符,以捕获协同共现模式,这些模式作为鲁棒情感识别的判别性标识。通过对数欧几里得映射(LEM)将这些描述符从黎曼流形映射至欧几里得切空间,所提方法在保持几何完整性的同时,实现了直接的线性判别学习。在ESD和RAVDESS数据集上的大量实验表明,SOC能够恢复在一阶池化中丢失的判别信息,并有效聚合高维SSL特征。