Accurately detecting active objects undergoing state changes is essential for comprehending human interactions and facilitating decision-making. The existing methods for active object detection (AOD) primarily rely on visual appearance of the objects within input, such as changes in size, shape and relationship with hands. However, these visual changes can be subtle, posing challenges, particularly in scenarios with multiple distracting no-change instances of the same category. We observe that the state changes are often the result of an interaction being performed upon the object, thus propose to use informed priors about object related plausible interactions (including semantics and visual appearance) to provide more reliable cues for AOD. Specifically, we propose a knowledge aggregation procedure to integrate the aforementioned informed priors into oracle queries within the teacher decoder, offering more object affordance commonsense to locate the active object. To streamline the inference process and reduce extra knowledge inputs, we propose a knowledge distillation approach that encourages the student decoder to mimic the detection capabilities of the teacher decoder using the oracle query by replicating its predictions and attention. Our proposed framework achieves state-of-the-art performance on four datasets, namely Ego4D, Epic-Kitchens, MECCANO, and 100DOH, which demonstrates the effectiveness of our approach in improving AOD.
翻译:准确检测正在经历状态变化的主动目标,对于理解人类交互行为和促进决策制定至关重要。现有主动目标检测方法主要依赖输入中目标的视觉外观特征(如尺寸变化、形态变化及与手部交互关系等)进行判断。然而,这些视觉变化往往具有细微性,特别是在存在多个同类静态干扰目标的场景中,检测难度显著提升。我们发现状态变化本质上是目标被施加交互行为的结果,因此提出利用与目标相关的合理交互先验知识(包括语义信息和视觉外观),为主动目标检测提供更可靠的线索。具体而言,我们设计了一种知识聚合流程,将上述先验知识整合至教师解码器中的先知查询中,从而提供更丰富的目标可供性常识以定位主动目标。为简化推理过程并减少额外知识输入,我们提出知识蒸馏方法,通过复制教师解码器在先知查询驱动下的预测结果与注意力分布,促使学生解码器模仿其检测能力。本框架在Ego4D、Epic-Kitchens、MECCANO和100DOH四个数据集上均取得最先进性能,充分验证了该方法的有效性。