Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.
翻译:理解手部及其直接或通过工具交互的物体,是从动作感知到三维重建及机器人等任务的关键一步。本文在手-物交互(HOI)理解领域做出以下贡献:(1)提出HOI-DETR新框架,将手-物体和物体-物体交互引入Co-DETR架构,实现了最先进的性能;(2)构建包含四个不同数据集的综合HOI评估套件,包括基于HD-EPIC数据集生成的视频基准以及改进Hands23基准的新标注;(3)预训练的检查点在Hands23、HOIST、FineBio和HD-EPIC数据集上显著提升了现有技术水平,其中在Hands23和FineBio上的平均精度(mAP)提升超过20个百分点。消融实验验证了每个模型组件的贡献。