We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction (HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centers to instances with textual guidance, thus enjoying two benefits: 1) Integrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object detector or instance decoder in prior methods, thus allowing the extraction of desirable extra cues for HOI detection in a single-stage and end-to-end pipeline. Concretely, AGER reduces GFLOPs by 8.5% and improves FPS by 36%, even compared to a vanilla DETR-like pipeline without extra cue extraction.
翻译:我们提出了一种聚合式Transformer(AGER),首次使得基于Transformer的人-物交互检测器能够以单阶段端到端的方式灵活利用额外的实例级线索。AGER通过动态聚类图像块令牌,并基于文本引导将聚类中心对齐至实例,从而获得两个优势:1)完整性:每个实例令牌被鼓励包含该实例的所有判别性特征区域,这显著提升了不同实例级线索的提取能力,并进而使HOI检测在HICO-Det上达到36.75 mAP的全新最优性能。2)高效性:动态聚类机制使AGER能够在Transformer编码器的特征学习过程中同步生成实例令牌,无需像先前方法那样依赖额外的目标检测器或实例解码器,从而在单阶段端到端流程中提取HOI检测所需的理想额外线索。具体而言,即便与不进行额外线索提取的类DETR基线流程相比,AGER仍能降低8.5%的GFLOPs并提升36%的FPS。