Query-based object detectors directly decode image features into object instances with a set of learnable queries. These query vectors are progressively refined to stable meaningful representations through a sequence of decoder layers, and then used to directly predict object locations and categories with simple FFN heads. In this paper, we present a new query-based object detector (DEQDet) by designing a deep equilibrium decoder. Our DEQ decoder models the query vector refinement as the fixed point solving of an {implicit} layer and is equivalent to applying {infinite} steps of refinement. To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement. Accordingly, we are able to incorporate refinement awareness into the DEQ training with the inexact gradient back-propagation (RAG). In addition, to stabilize the training of our DEQDet and improve its generalization ability, we devise the deep supervision scheme on the optimization path of DEQ with refinement-aware perturbation~(RAP). Our experiments demonstrate DEQDet converges faster, consumes less memory, and achieves better results than the baseline counterpart (AdaMixer). In particular, our DEQDet with ResNet50 backbone and 300 queries achieves the $49.5$ mAP and $33.0$ AP$_s$ on the MS COCO benchmark under $2\times$ training scheme (24 epochs).
翻译:基于查询的目标检测器通过一组可学习查询向量直接将图像特征解码为对象实例。这些查询向量经过一系列解码器层逐步精炼为稳定且有意义的表示,并随后通过简单的前馈神经网络头直接预测目标位置和类别。本文提出一种新的基于查询的目标检测器(DEQDet),通过设计深度平衡解码器实现。我们的DEQ解码器将查询向量精炼建模为隐式层的定点求解,等效于应用无穷步精炼。针对目标解码的特殊性,我们采用两步展开的平衡方程显式捕获查询向量精炼过程。据此,我们提出利用非精确梯度反向传播的精炼感知训练(RAG)将精炼感知融入DEQ训练。此外,为稳定DEQDet训练并提升泛化能力,我们设计了基于精炼感知扰动(RAP)的深度监督方案,对DEQ优化路径进行监督。实验表明,DEQDet相比基线模型(AdaMixer)收敛更快、内存消耗更少、性能更优。特别地,采用ResNet50骨干网络和300个查询的DEQDet在MS COCO基准测试中以2×训练方案(24轮)取得了49.5 mAP和33.0 AP_s的结果。