Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).
翻译:时尚单品检测具有挑战性,原因在于时尚单品外观高度多样化带来的模糊性以及各子类别之间的相似性。为应对这一挑战,我们提出了一种新颖的整体检测Transformer(Holi-DETR),它通过利用上下文信息,整体性地检测穿搭图像中的时尚单品。时尚单品通常具有有意义的关联,因为它们被组合以创造特定风格。与独立检测每个单品的传统检测器不同,Holi-DETR通过利用三种不同类型的上下文信息来检测多个单品并减少模糊性:(1)时尚单品之间的共现关系,(2)基于单品间空间布局的相对位置和尺寸,以及(3)单品与人体关键点之间的空间关系。为此,我们提出了一种新颖的架构,将这三类异构上下文信息集成到Detection Transformer(DETR)及其后续模型中。在实验中,所提出的方法在平均精度(AP)方面,分别将原始DETR和近期开发的Co-DETR的性能提升了3.6个百分点(pp)和1.1个百分点。