With the surge in attention to Egocentric Hand-Object Interaction (Ego-HOI), large-scale datasets such as Ego4D and EPIC-KITCHENS have been proposed. However, most current research is built on resources derived from third-person video action recognition. This inherent domain gap between first- and third-person action videos, which have not been adequately addressed before, makes current Ego-HOI suboptimal. This paper rethinks and proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA). We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy. With our new framework, we not only achieve state-of-the-art performance on Ego-HOI benchmarks but also build several new and effective mechanisms and settings to advance further research. We believe our data and the findings will pave a new way for Ego-HOI understanding. Code and data are available at https://mvig-rhos.com/ego_pca
翻译:随着对自我中心手-物交互(Ego-HOI)研究的关注激增,Ego4D和EPIC-KITCHENS等大规模数据集相继被提出。然而,当前多数研究仍基于第三人称视角视频动作识别衍生的资源。第一人称与第三人称动作视频之间固有的领域差距此前未得到充分解决,导致现有Ego-HOI方法性能欠佳。本文重新思考并提出一种新框架作为基础设施,通过探查(Probing)、策展(Curation)与适配(Adaption)来推进Ego-HOI识别(EgoPCA)。我们贡献了全面的预训练集、均衡测试集以及新的基线方法,并配套完整的训练-微调策略。借助这一新框架,我们不仅在Ego-HOI基准测试上取得了最优性能,还构建了多种新颖有效的机制与设定以促进后续研究。我们坚信,所提供的数据与发现将为Ego-HOI理解开辟新路径。代码与数据已开源至https://mvig-rhos.com/ego_pca。