Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high-level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high-density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.
翻译:高密度场景下的人群计数模型面临两大挑战:定位能力薄弱及难以区分前景与背景,导致计数结果不准确。其根源在于密集区域目标通常体积微小,而卷积神经网络提取的高层特征对微小物体的表征能力较弱。针对上述问题,本文提出一种面向人群计数的判别性特征学习框架,该框架由掩码特征预测模块(MPM)与监督式像素级对比学习模块(CLM)构成。MPM首先在特征图中随机遮蔽特征向量并对其进行重建,使模型掌握遮蔽区域所包含的信息,从而增强模型在高密度区域的目标定位能力。CLM在特征空间中拉近同类目标距离的同时推远其与背景的距离,使模型具备区分前景目标与背景的能力。此外,所提出的模块可广泛适用于各类计算机视觉任务(如人群计数与目标检测),特别适用于密集场景或杂乱环境下精准定位困难的应用场景。这两个模块具有即插即用特性,将其集成至现有模型中有望提升其在上述场景中的性能表现。