Human-Centric Transformer for Domain Adaptive Action Recognition

We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to losing human cues but prefer to exploit the correlation between non-human contexts and associated actions for recognition, and the contexts of interest agnostic to actions would reduce recognition performance in the target domain. To overcome this problem, we focus on uncovering human-centric action cues for domain adaptive action recognition, and our conception is to investigate two aspects of human-centric action cues, namely human cues and human-context interaction cues. Accordingly, our proposed Human-Centric Transformer (HCTransformer) develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning. Our HCTransformer first conducts human-aware temporal modeling by a human encoder, aiming to avoid a loss of human cues during domain-invariant video feature learning. Then, by a Transformer-like architecture, HCTransformer exploits domain-invariant and action-correlated contexts by a context encoder, and further models domain-invariant interaction between humans and action-correlated contexts. We conduct extensive experiments on three benchmarks, namely UCF-HMDB, Kinetics-NecDrone and EPIC-Kitchens-UDA, and the state-of-the-art performance demonstrates the effectiveness of our proposed HCTransformer.

翻译：我们研究了动作识别中的领域适应任务，即领域自适应动作识别，其目标是将动作识别能力从标签充足的源域有效迁移至无标签的目标域。由于动作由人类执行，在跨领域识别动作时，充分利用视频中的人类线索至关重要。然而，现有方法容易丢失人类线索，而倾向于利用非人类上下文与关联动作之间的相关性进行识别；与动作无关的上下文会降低目标域中的识别性能。为克服此问题，我们聚焦于发掘面向领域自适应动作识别的人本动作线索，其核心思想是探究人本动作线索的两个方面：人类线索与人-上下文交互线索。相应地，我们提出的"人本Transformer"（HCTransformer）开发了一种解耦的人本学习范式，以在领域可变视频特征学习中显式聚焦于人本动作线索。我们的HCTransformer首先通过人类编码器进行人类感知时序建模，旨在避免领域不变视频特征学习中人类线索的丢失。随后，通过类Transformer架构，HCTransformer利用上下文编码器提取领域不变且与动作相关的上下文，并进一步建模人类与动作相关上下文之间的领域不变交互。我们在三个基准数据集（UCF-HMDB、Kinetics-NecDrone和EPIC-Kitchens-UDA）上进行了大量实验，其最先进的性能验证了所提HCTransformer的有效性。