Crowd counting remains challenging in variable-density scenes due to scale variations, occlusions, and the high computational cost of existing models. To address these issues, we propose RepSFNet (Reparameterized Single Fusion Network), a lightweight architecture designed for accurate and real-time crowd estimation. RepSFNet leverages a RepLK-ViT backbone with large reparameterized kernels for efficient multi-scale feature extraction. It further integrates a Feature Fusion module combining Atrous Spatial Pyramid Pooling (ASPP) and Context-Aware Network (CAN) to achieve robust, density-adaptive context modeling. A Concatenate Fusion module is employed to preserve spatial resolution and generate high-quality density maps. By avoiding attention mechanisms and multi-branch designs, RepSFNet significantly reduces parameters and computational complexity. The training objective combines Mean Squared Error and Optimal Transport loss to improve both count accuracy and spatial distribution alignment. Experiments conducted on ShanghaiTech, NWPU, and UCF-QNRF datasets demonstrate that RepSFNet achieves competitive accuracy while reducing inference latency by up to 34 percent compared to recent state-of-the-art methods, making it suitable for real-time and low-power edge computing applications.
翻译:人群计数在变密度场景中仍然面临挑战,这主要源于尺度变化、遮挡以及现有模型的高计算成本。为解决这些问题,我们提出了RepSFNet(重参数化单融合网络),这是一种为精确实时人群估计而设计的轻量级架构。RepSFNet利用具有大型重参数化卷积核的RepLK-ViT主干网络,以实现高效的多尺度特征提取。它进一步集成了结合空洞空间金字塔池化(ASPP)与上下文感知网络(CAN)的特征融合模块,以实现鲁棒的、密度自适应的上下文建模。通过采用拼接融合模块,网络得以保持空间分辨率并生成高质量密度图。通过避免注意力机制和多分支设计,RepSFNet显著减少了参数量和计算复杂度。训练目标结合了均方误差损失与最优传输损失,以同时提升计数精度和空间分布对齐。在上海科技大学、NWPU和UCF-QNRF数据集上进行的实验表明,RepSFNet在达到有竞争力的精度的同时,与当前最先进方法相比,推理延迟降低了高达34%,使其适用于实时和低功耗边缘计算应用。