Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model proved to be more accurate and robust compared to fully-supervised methods on low data regimes.
翻译:近年来,基于外周生理信号的可穿戴情感识别因其低侵入性和在现实场景中的适用性而受到广泛关注。然而,如何有效融合多模态数据仍然是一个具有挑战性的问题。此外,传统的全监督方法在标注数据有限的情况下容易出现过度拟合。为解决上述问题,我们提出了一种用于可穿戴情感识别的新型自监督学习框架,该框架通过基于时间卷积的模态特定编码器和基于Transformer的共享编码器实现高效多模态融合,捕获模态内和模态间的相关性。通过五种信号变换自动为大量未标注数据分配标签,并以信号变换识别作为预文本任务对所提出的自监督学习模型进行预训练,从而提取用于情感相关下游任务的通用多模态表示。为进行评估,我们首先在一个大规模自采集生理数据集上预训练所提出的自监督学习模型,随后将得到的编码器在三个公开监督情感识别数据集上进行冻结或微调。最终,我们的自监督学习方法在各类情感分类任务中取得了最先进的性能。同时,在低数据量条件下,所提出的模型相比全监督方法表现出更高的准确性和鲁棒性。