Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network and learn to predict the location and category of one specific object from each query. We empirically find that random convex combinations of the learned queries are still good for the corresponding models. We then propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image. The generated dynamic queries, named modulated queries, better capture the prior of object locations and categories in the different images. Equipped with our modulated queries, a wide range of DETR-based models achieve consistent and superior performance across multiple tasks including object detection, instance segmentation, panoptic segmentation, and video instance segmentation.
翻译:基于Transformer的检测与分割方法利用一组学习得到的检测查询(detection queries)从Transformer网络中检索信息,并通过每个查询学习预测单个目标的位置和类别。我们通过实验发现,这些学习得到的查询的随机凸组合仍然适用于相应的模型。进而,我们提出基于图像的高层语义学习具有动态系数的凸组合。生成的动态查询(称为调制查询)能够更好地捕捉不同图像中目标位置和类别的先验信息。配备我们的调制查询后,多种基于DETR的模型在包括目标检测、实例分割、全景分割和视频实例分割在内的多个任务上均取得了持续且优越的性能。