Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR. However, the current methods redirect the detection target of the object decoder, and the box target is not explicitly separated from the query embeddings, which leads to long and hard training. Furthermore, matching the predicted HOI instances with the ground-truth is more challenging than object detection, simply adapting training strategies from the object detection makes the training more difficult. To clear the ambiguity between human and object detection and share the prediction burden, we propose a novel one-stage framework (SOV), which consists of a subject decoder, an object decoder, and a verb decoder. Moreover, we propose a novel Specific Target Guided (STG) DeNoising strategy, which leverages learnable object and verb label embeddings to guide the training and accelerates the training convergence. In addition, for the inference part, the label-specific information is directly fed into the decoders by initializing the query embeddings from the learnable label embeddings. Without additional features or prior language knowledge, our method (SOV-STG) achieves higher accuracy than the state-of-the-art method in one-third of training epochs. The code is available at \url{https://github.com/cjw2021/SOV-STG}.
翻译:近期基于Transformer的单阶段方法通过利用DETR的检测能力,在人类-物体交互检测任务中取得了显著进展。然而,现有方法将物体解码器的检测目标重定向,且未将边界框目标与查询嵌入明确分离,导致训练过程漫长且困难。此外,预测的HOI实例与真实标注的匹配比目标检测更具挑战性,简单套用目标检测的训练策略会使训练更加困难。为消除人与物体检测之间的歧义并分担预测负担,我们提出了一种新颖的单阶段框架SOV,该框架包含主体解码器、客体解码器和动作解码器。同时,我们提出了一种特定目标引导去噪策略,利用可学习的物体和动作标签嵌入来引导训练,加速训练收敛。在推理阶段,通过从可学习的标签嵌入初始化查询嵌入,将标签特定信息直接输入解码器。无需额外特征或先验语言知识,我们的方法(SOV-STG)仅用三分之一的训练周期即可达到比现有最优方法更高的精度。代码已开源:\url{https://github.com/cjw2021/SOV-STG}。