Human-Object Interaction (HOI) detection is a task to localize humans and objects in an image and predict the interactions in human-object pairs. In real-world scenarios, HOI detection models are required systematic generalization, i.e., generalization to novel combinations of objects and interactions, because the train data are expected to cover a limited portion of all possible combinations. However, to our knowledge, no open benchmarks or previous work exist for evaluating the systematic generalization performance of HOI detection models. To address this issue, we created two new sets of HOI detection data splits named HICO-DET-SG and V-COCO-SG based on the HICO-DET and V-COCO datasets, respectively. When evaluated on the new data splits, the representative HOI detection models performed much more poorly than when evaluated on the original splits. This reveals that systematic generalization is a challenging goal in HOI detection. By analyzing the evaluation results, we also gain insights for improving the systematic generalization performance and identify four possible future research directions. We hope that our new data splits and presented analysis will encourage further research on systematic generalization in HOI detection.
翻译:人-物交互(HOI)检测是一项在图像中定位人与物体并预测人-物对交互的任务。在真实场景中,HOI检测模型需要具备系统泛化能力,即泛化至物体与交互的新组合能力,因为训练数据预计仅覆盖所有可能组合的有限部分。然而,据我们所知,目前尚无公开基准或先前工作可用于评估HOI检测模型的系统泛化性能。为解决此问题,我们基于HICO-DET和V-COCO数据集分别创建了两组新的HOI检测数据划分,命名为HICO-DET-SG与V-COCO-SG。在新数据划分上评估时,代表性HOI检测模型的性能远低于其在原始划分上的表现。这表明系统泛化在HOI检测中是一项具有挑战性的目标。通过分析评估结果,我们进一步获得了提升系统泛化性能的见解,并识别出四个可能的未来研究方向。我们期望新数据划分及所呈现的分析能够推动HOI检测中系统泛化研究的进一步发展。