We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9% to 65.56% (+3.66pp), while frequency-domain features for EEG achieved 67.62% (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached 75.30%, exceeding the paper's ViViT result (74.5%) through domain-specific pretraining, and vision delta features achieved 72.68% (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.
翻译:本研究基于EAV数据集开展多模态情感识别的系统性研究,旨在探究复杂注意力机制在小规模数据集上的性能表现。我们实现了三类模型:基准Transformer模型(M1)、新型因子化注意力机制模型(M2)以及改进的CNN基准模型(M3)。实验表明,精密的注意力机制在小规模数据集上持续表现不佳。M2模型因过拟合与预训练特征破坏问题,性能较基准模型低5至13个百分点。相比之下,简单的领域适应性改进效果显著:在音频CNN中引入差分MFCC特征将准确率从61.9%提升至65.56%(+3.66个百分点);脑电信号的频域特征达到67.62%准确率(较论文基准提升7.62个百分点)。我们的视觉Transformer基准模型(M1)通过领域特定预训练达到75.30%准确率,超越原论文ViViT结果(74.5%),视觉差分特征则实现72.68%准确率(较论文CNN基准提升1.28个百分点)。这些发现证明,在小规模情感识别任务中,领域知识与恰当的实施策略优于架构复杂性。