United we stand, Divided we fall: Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition in Valence-Arousal Space

Audio and visual modalities are two predominant contact-free channels in videos, which are often expected to carry a complementary relationship with each other. However, they may not always complement each other, resulting in poor audio-visual feature representations. In this paper, we introduce Gated Recursive Joint Cross Attention (GRJCA) using a gating mechanism that can adaptively choose the most relevant features to effectively capture the synergic relationships across audio and visual modalities. Specifically, we improve the performance of Recursive Joint Cross-Attention (RJCA) by introducing a gating mechanism to control the flow of information between the input features and the attended features of multiple iterations depending on the strength of their complementary relationship. For instance, if the modalities exhibit strong complementary relationships, the gating mechanism emphasizes cross-attended features, otherwise non-attended features. To further improve the performance of the system, we also explored a hierarchical gating approach by introducing a gating mechanism at every iteration, followed by high-level gating across the gated outputs of each iteration. The proposed approach improves the performance of RJCA model by adding more flexibility to deal with weak complementary relationships across audio and visual modalities. Extensive experiments are conducted on the challenging Affwild2 dataset to demonstrate the robustness of the proposed approach. By effectively handling the weak complementary relationships across the audio and visual modalities, the proposed model achieves a Concordance Correlation Coefficient (CCC) of 0.561 (0.623) and 0.620 (0.660) for valence and arousal respectively on the test set (validation set).

翻译：音频与视觉模态是视频中两种主要的非接触式通道，通常被认为具有互补关系。然而，它们并非总能有效互补，这可能导致视听特征表示效果不佳。本文提出了一种基于门控机制的递归联合交叉注意力（GRJCA）方法，能够自适应地选择最相关的特征，以有效捕捉音频与视觉模态间的协同关系。具体而言，我们通过引入门控机制改进了递归联合交叉注意力（RJCA）的性能，该机制根据模态间互补关系的强弱，控制输入特征与多次迭代中注意力特征之间的信息流动。例如，若模态间表现出强互补关系，门控机制会强化交叉注意力特征；反之则侧重非注意力特征。为进一步提升系统性能，我们还探索了分层门控方法：在每次迭代中引入门控机制，再对每次迭代的门控输出进行高层级门控。所提方法通过增强处理视听模态间弱互补关系的灵活性，有效提升了RJCA模型的性能。我们在具有挑战性的Affwild2数据集上进行了大量实验，验证了该方法的鲁棒性。通过有效处理音频与视觉模态间的弱互补关系，所提模型在测试集（验证集）上对效价和唤醒度的和谐相关系数（CCC）分别达到0.561（0.623）和0.620（0.660）。