Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.
翻译:自动驾驶中的目标检测需要精确定位和对共现目标之间关系上下文的固有理解。在极其复杂的异质环境中,稀有类别、小尺度目标和频繁出现的目标对于标准目标检测框架而言难以处理。在本文中,我们提出了一种名为上下文中心特征融合(CCFF)的新框架,该框架利用两个基于注意力的模块:局部上下文融合模块(LCFM)使用RoI到RoI的自注意力机制来解决空间交互问题,主要考虑小目标和部分被遮挡的目标;而全局上下文注意力模块(GCAM)则通过将Top-K RoI特征池化为全局上下文注意力令牌来转换目标的共现先验,避免了像素级全局池化的计算开销。这种局部特征与以目标为中心的全局特征的融合产生了上下文嵌入表示,从而提升了分类结果和共现目标检测性能。我们的方法在两个数据集Cityscapes和BDD100K上进行了评估,结果显示在关系一致性方面有显著提升,分别达到了0.973和0.969的类别级一致性策略(CCS)。此外,我们的方法在小目标检测(AP_S:14.1%)方面取得了显著收益,并成功恢复了通常在大分布中丢失的稀有类别,例如"火车"。我们的效率报告显示,该框架能以0.2 FPS的额外开销实时处理图像。代码已开源在https://github.com/BinayKSingh/CCFF。