In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
翻译:本研究提出了一种利用MobileNetV4与多尺度3D MLP-Mixer时序聚合模块的高效时空特征提取方法。采用具备通用逆瓶颈(UIB)模块的MobileNetV4作为主干网络,从输入图像序列中提取层次化特征表示,在保证计算效率的同时实现丰富的语义编码。为捕捉时序依赖关系,我们引入三级MLP-Mixer模块,该模块能在保持结构完整性的前提下处理多分辨率空间特征。在ABAW第八届竞赛中的实验结果表明,所提方法在情感行为分析任务中展现出优越性能。通过将高效视觉主干网络与结构化时序建模机制相结合,该框架实现了计算效率与预测精度之间的平衡,使其特别适用于移动及嵌入式计算环境中的实时应用场景。