This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset.
翻译:本文提出了我们在ABAW6竞赛中价-唤醒度(Valence-Arousal, VA)估计任务的解决方案。通过预处理视频帧和音频片段以提取视觉与音频特征,我们设计了一个综合模型。借助时序卷积网络(TCN)模块,我们有效捕捉了这些特征之间的时空关联性。随后,采用Transformer编码器结构学习长程依赖关系,从而提升模型的性能与泛化能力。该方法基于多模态数据融合策略,集成预训练的音频与视频骨干网络进行特征提取,进而利用TCN实现时空编码,并借助Transformer捕获时序信息。实验结果表明,该方案在AffWild2数据集上的VA估计任务中取得了具有竞争力的性能。