Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

Human state recognition is a critical topic with pervasive and important applications in human-machine systems. Multi-modal fusion, the combination of metrics from multiple data sources, has been shown as a sound method for improving the recognition performance. However, while promising results have been reported by recent multi-modal-based models, they generally fail to leverage the sophisticated fusion strategies that would model sufficient cross-modal interactions when producing the fusion representation; instead, current methods rely on lengthy and inconsistent data preprocessing and feature crafting. To address this limitation, we propose an end-to-end multi-modal transformer framework for multi-modal human state recognition called Husformer. Specifically, we propose to use cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Using two such attention mechanisms enables effective and adaptive adjustments to noise and interruptions in multi-modal signals during the fusion process and in relation to high-level features. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad) demonstrate that in the recognition of human state, our Husformer outperforms both state-of-the-art multi-modal baselines and the use of a single modality by a large margin, especially when dealing with raw multi-modal signals. We also conducted an ablation study to show the benefits of each component in Husformer.

翻译：摘要：人体状态识别是人机系统中具有广泛应用前景的关键课题。多模态融合——即整合来自多个数据源的指标——已被证明是提升识别性能的有效方法。然而，尽管近期基于多模态的模型取得了令人瞩目的成果，但这些模型通常未能充分利用复杂的融合策略来建模充分的跨模态交互以生成融合表征；相反，现有方法依赖冗长且不一致的数据预处理与特征工程。为解决这一局限性，我们提出一种名为Husformer的端到端多模态Transformer框架，用于多模态人体状态识别。具体而言，我们提出采用跨模态Transformer，通过直接关注其他模态中隐含的潜在相关性来强化某一模态，从而在确保充分感知跨模态交互的前提下融合不同模态。随后，我们利用自注意力Transformer进一步优先处理融合表征中的上下文信息。通过这两种注意力机制的协同作用，能在融合过程中及高层特征中有效且自适应地调整多模态信号中的噪声与干扰。在两大人类情感语料库（DEAP与WESAD）及两个认知负荷数据集（MOCAS与CogLoad）上的大量实验表明：相较于最先进的多模态基线方法及单一模态，我们的Husformer在人体状态识别任务中取得显著优势，尤其是在处理原始多模态信号时表现突出。我们还通过消融实验验证了Husformer各组件的贡献。