We introduce Temporal consistency for Test-time adaptation (TempT) a novel method for test-time adaptation on videos through the use of temporal coherence of predictions across sequential frames as a self-supervision signal. TempT is an approach with broad potential applications in computer vision tasks including facial expression recognition (FER) in videos. We evaluate TempT performance on the AffWild2 dataset. Our approach focuses solely on the unimodal visual aspect of the data and utilizes a popular 2D CNN backbone in contrast to larger sequential or attention-based models used in other approaches. Our preliminary experimental results demonstrate that TempT has competitive performance compared to the previous years reported performances and its efficacy provides a compelling proof-of-concept for its use in various real-world applications.
翻译:我们提出一种面向测试时自适应的时间一致性方法(TempT),该方法通过利用视频序列帧间预测的时间一致性作为自监督信号,实现视频域的测试时自适应。TempT在计算机视觉任务(包括视频人脸表情识别)中具有广泛的应用潜力。我们在AffWild2数据集上评估TempT性能。与采用更大规模序列模型或注意力模型的其他方法不同,本方法仅聚焦于数据的单模态视觉特征,并使用经典的2D CNN骨干网络。初步实验结果表明,TempT相比往年报告的性能具有竞争力,其有效性为实际应用场景提供了极具说服力的概念验证。