Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous RoI feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel RoI feature extractor, termed Semantic RoI Align (SRA), which is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the RoI. We also propose a Dynamic Feature Sampler which dynamically samples features based on the RoI aspect ratio to enhance the efficiency of SRA, and a new position embedding, \ie Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-of-the-art backbones and detection methods.
翻译:在过去十年中,基于学习的目标检测方法取得了显著进展。两阶段检测器通常比单阶段检测器具有更高的检测精度,这得益于其使用了感兴趣区域(RoI)特征提取器,能够为不同的RoI候选区域提取变换不变性RoI特征,从而使边界框的精化与目标类别的预测更加稳健和准确。然而,先前的RoI特征提取器仅能在有限变换下提取不变特征。本文提出了一种新型RoI特征提取器——语义RoI对齐(Semantic RoI Align, SRA),它能够为两阶段检测器在多种变换下提取不变性RoI特征。具体来说,我们设计了一个语义注意力模块,通过利用RoI内的全局与局部语义关系自适应地确定不同的采样区域;同时提出了一种动态特征采样器(Dynamic Feature Sampler),根据RoI宽高比动态采样特征以提升SRA的效率;此外,还引入了一种新的位置嵌入方法——区域嵌入(Area Embedding),通过改进的采样区域表示为SRA提供更精确的位置信息。实验表明,我们的模型以极小的计算开销显著优于基线模型,并且展现出优秀的泛化能力,可结合多种先进主干网络与检测方法进一步提升性能。