Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.

翻译：人类动作识别是一个具有挑战性的问题，尤其是在主体外观、背景和视角等因素存在高度变异性的情况下。虽然深度神经网络（DNNs）已被证明在动作识别任务上表现良好，但它们通常需要大量高质量的标注数据才能在各种条件下实现鲁棒性能。合成数据作为一种避免在现实世界中收集和标注海量数据所带来的巨大成本及潜在伦理问题的方法，已显示出潜力。然而，合成数据可能在重要方面与真实数据存在差异。这种现象被称为\textit{域偏移}，可能限制合成数据在机器人应用中的效用。为了减轻域偏移的影响，大量研究工作正致力于开发领域适应（DA）技术。然而，关于如何最佳地开发这些技术，仍有许多有待理解之处。本文中，我们引入了一个名为机器人控制手势（RoCoG-v2）的新数据集。该数据集由来自七个手势类别的真实和合成视频组成，旨在支持基于视频的动作识别中合成到真实域偏移的研究。我们的工作通过将动作类别聚焦于人机协作手势，并支持在地面视图和空中视图中研究域偏移，扩展了现有数据集。我们使用最先进的动作识别和领域适应算法给出了基线结果，并提供了应对合成到真实以及地面到空中域偏移的初步见解。