Human-object interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries in representing diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.
翻译:采用主流查询-Transformer架构的人类-物体交互检测器已取得显著性能,但在准确识别非常见视觉模式及区分模糊交互关系方面仍面临挑战。我们发现这些困难可能源于传统检测器查询在表征多样化的类内模式与类间关联时存在局限性。为此,我们提出交互提示分布学习方法。该方法通过多组软提示学习机制,从不同提示中估计类别分布,并将交互查询与类别分布相融合,使其能够表征近乎无限的类内动态变化与普适的跨类别关联。我们的检测器在HICO-DET和vcoco基准测试中展现出优越性能。此外,本方法可兼容多数基于Transformer的交互检测架构,仅需增加少量参数即可显著提升其检测性能。