Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/mlvlab/SpeaQ.
翻译:视觉关系检测(VRD)近年来随着基于Transformer的架构取得了显著进展。然而,我们发现传统标签分配方法(即从真实值(GT)到预测结果的映射过程)在训练基于Transformer的VRD模型时存在两个关键局限。在传统分配下,由于每个查询需要检测所有关系,导致查询被训练为未专业化形式,这使其难以专注特定关系。此外,查询训练也不充分——因为一个GT仅分配给单个预测结果,导致接近正确甚至完全正确的预测会因未被分配任何关系标签而受到抑制。为解决这些问题,我们提出了群组查询专业化与质量感知多重分配(SpeaQ)。群组查询专业化通过将查询与关系划分为不相交的组别,并引导特定查询组中的查询仅关注对应关系组中的关系,从而训练专业化查询。质量感知多重分配则进一步促进训练:将GT分配给在主体、客体及其关系上都与GT高度接近的多个预测结果。实验与分析表明,SpeaQ能有效训练专业化查询,从而更好地利用模型容量,在多个VRD模型与基准测试中以零额外推理成本实现一致性的性能提升。代码发布于 https://github.com/mlvlab/SpeaQ。