Understanding human interaction with objects is an important research topic for embodied Artificial Intelligence and identifying the objects that humans are interacting with is a primary problem for interaction understanding. Existing methods rely on frame-based detectors to locate interacting objects. However, this approach is subjected to heavy occlusions, background clutter, and distracting objects. To address the limitations, in this paper, we propose to leverage spatio-temporal information of hand-object interaction to track interactive objects under these challenging cases. Without prior knowledge of the general objects to be tracked like object tracking problems, we first utilize the spatial relation between hands and objects to adaptively discover the interacting objects from the scene. Second, the consistency and continuity of the appearance of objects between successive frames are exploited to track the objects. With this tracking formulation, our method also benefits from training on large-scale general object-tracking datasets. We further curate a video-level hand-object interaction dataset for testing and evaluation from 100DOH. The quantitative results demonstrate that our proposed method outperforms the state-of-the-art methods. Specifically, in scenes with continuous interaction with different objects, we achieve an impressive improvement of about 10% as evaluated using the Average Precision (AP) metric. Our qualitative findings also illustrate that our method can produce more continuous trajectories for interacting objects.
翻译:理解人与物体的交互是具身人工智能的重要研究课题,而识别人类正在交互的物体是交互理解的首要问题。现有方法依赖基于帧的检测器来定位交互物体,然而这种方法容易受到严重遮挡、背景杂乱和干扰物体的影响。为解决这些局限性,本文提出利用手-物体交互的时空信息,在挑战性场景下跟踪交互物体。与需要被跟踪通用物体先验知识的物体跟踪问题不同,我们首先利用手与物体之间的空间关系,自适应地从场景中发现交互物体;其次,利用物体在连续帧之间外观的一致性和连续性进行跟踪。通过这种跟踪范式,我们的方法还能受益于大规模通用物体跟踪数据集的训练。我们进一步从100DOH数据集中筛选构建了一个视频级手-物体交互数据集用于测试与评估。定量结果表明,所提方法性能优于当前最优方法。具体而言,在与不同物体连续交互的场景中,采用平均精度(AP)指标评估时,我们实现了约10%的显著提升。定性分析也表明,我们的方法能够为交互物体生成更连续的轨迹。