In this study, we investigate the effectiveness of synthetic data in enhancing hand-object interaction detection within the egocentric vision domain. We introduce a simulator able to generate synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Through comprehensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, we demonstrate that the use of synthetic data and domain adaptation techniques allows for comparable performance to conventional supervised methods while requiring annotations on only a fraction of the real data. When tested with in-domain synthetic data generated from 3D models of real target environments and objects, our best models show consistent performance improvements with respect to standard fully supervised approaches based on labeled real data only. Our study also sets a new benchmark of domain adaptation for egocentric hand-object interaction detection (HOI-Synth) and provides baseline results to encourage the community to engage in this challenging task. We release the generated data, code, and the simulator at the following link: https://iplab.dmi.unict.it/HOI-Synth/.
翻译:在本研究中,我们探究了合成数据在增强自我中心视觉领域手物交互检测效果方面的有效性。我们引入了一个模拟器,能够自动生成带有手物接触状态、边界框和像素级分割掩码注释的合成手物交互图像。通过在VISOR、EgoHOS和ENIGMA-51三个自我中心数据集上进行综合实验和比较分析,我们证明使用合成数据和领域自适应技术可以达到与传统监督方法相当的性能,而仅需对真实数据的一小部分进行标注。当使用来自真实目标环境和物体的3D模型生成的域内合成数据进行测试时,我们的最佳模型相较于仅基于标注真实数据的标准全监督方法,展现出持续的性能提升。本研究还建立了自我中心手物交互检测领域自适应的新基准(HOI-Synth),并提供了基线结果,以鼓励学界参与这一挑战性任务。我们在以下链接发布了生成的数据、代码和模拟器:https://iplab.dmi.unict.it/HOI-Synth/。