Curriculum-Based Strategies for Efficient Cross-Domain Action Recognition

Despite significant progress in human action recognition, generalizing to diverse viewpoints remains a challenge. Most existing datasets are captured from ground-level perspectives, and models trained on them often struggle to transfer to drastically different domains such as aerial views. This paper examines how curriculum-based training strategies can improve generalization to unseen real aerial-view data without using any real aerial data during training. We explore curriculum learning for cross-view action recognition using two out-of-domain sources: synthetic aerial-view data and real ground-view data. Our results on the evaluation on order of training (fine-tuning on synthetic aerial data vs. real ground data) shows that fine-tuning on real ground data but differ in how they transition from synthetic to real. The first uses a two-stage curriculum with direct fine-tuning, while the second applies a progressive curriculum that expands the dataset in multiple stages before fine-tuning. We evaluate both methods on the REMAG dataset using SlowFast (CNN-based) and MViTv2 (Transformer-based) architectures. Results show that combining the two out-of-domain datasets clearly outperforms training on a single domain, whether real ground-view or synthetic aerial-view. Both curriculum strategies match the top-1 accuracy of simple dataset combination while offering efficiency gains. With the two-step fine-tuning method, SlowFast achieves up to a 37% reduction in iterations and MViTv2 up to a 30% reduction compared to simple combination. The multi-step progressive approach further reduces iterations, by up to 9% for SlowFast and 30% for MViTv2, relative to the two-step method. These findings demonstrate that curriculum-based training can maintain comparable performance (top-1 accuracy within 3% range) while improving training efficiency in cross-view action recognition.

翻译：尽管人体动作识别领域已取得显著进展，但模型向多样化视角的泛化能力仍面临挑战。现有数据集大多从地面视角采集，基于这些数据训练的模型往往难以迁移至差异巨大的领域（如航拍视角）。本文研究了如何在不使用任何真实航拍数据训练的情况下，通过基于课程学习的训练策略提升模型对未见真实航拍数据的泛化能力。我们利用两种域外数据源——合成航拍数据与真实地面视角数据，探索跨视角动作识别中的课程学习方法。通过对训练顺序（在合成航拍数据与真实地面数据上微调）的评估，我们发现两种策略均以真实地面数据微调收尾，但从合成数据向真实数据的过渡方式不同：第一种采用两阶段课程学习配合直接微调；第二种采用渐进式课程学习，在微调前通过多阶段逐步扩展数据集。我们在REMAG数据集上使用SlowFast（基于CNN）和MViTv2（基于Transformer）架构对两种方法进行评估。结果表明，结合两种域外数据集的训练效果明显优于单一领域（无论是真实地面视角还是合成航拍视角）训练。两种课程学习策略在达到与简单数据集组合相当的top-1准确率的同时，显著提升了训练效率。采用两步微调法时，与简单组合相比，SlowFast的迭代次数最多减少37%，MViTv2最多减少30%。而多步渐进式方法进一步降低了迭代次数：相较于两步法，SlowFast最多减少9%，MViTv2最多减少30%。这些发现证明，在跨视角动作识别任务中，基于课程学习的训练方法能在保持可比性能（top-1准确率差异在3%以内）的同时，有效提升训练效率。