In this technical report, we introduce TempT, a novel method for test time adaptation on videos by ensuring temporal coherence of predictions across sequential frames. TempT is a powerful tool with broad applications in computer vision tasks, including facial expression recognition (FER) in videos. We evaluate TempT's performance on the AffWild2 dataset as part of the Expression Classification Challenge at the 5th Workshop and Competition on Affective Behavior Analysis in the wild (ABAW). Our approach focuses solely on the unimodal visual aspect of the data and utilizes a popular 2D CNN backbone, in contrast to larger sequential or attention based models. Our experimental results demonstrate that TempT has competitive performance in comparison to previous years reported performances, and its efficacy provides a compelling proof of concept for its use in various real world applications.
翻译:在本技术报告中,我们提出TempT方法,这是一种通过确保预测结果在连续帧间的时序一致性来实现视频测试时自适应的新方法。TempT作为一种强大工具,在计算机视觉任务中具有广泛应用,包括视频中的面部表情识别(FER)。我们在AffWild2数据集上评估了TempT的性能,该数据集来自第五届野生情感行为分析研讨会暨竞赛(ABAW)的表情分类挑战。我们的方法仅关注数据的单模态视觉特征,并采用流行的2D CNN主干网络,与更大的时序模型或基于注意力的模型形成对比。实验结果表明,与往年报告的性能相比,TempT表现出竞争力,其有效性为在各类实际应用中的使用提供了令人信服的概念验证。