Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly. This approach empowers our models to achieve improved performance while maintaining a compact and efficient design. With the comparison of our proposed models with previously available state-of-the-art (SOTA) methods on ShanghaiTech-A ShanghaiTech-B and UCF-CC-50 dataset, it achieves comparable results while being the most computationally efficient model. Finally, we present a comparative study, an extensive ablation study, along with pruning to show the effectiveness of our models.
翻译:人群计数在实际场景中具有直接应用价值,因此计算效率与性能至关重要。然而,现有方法大多依赖庞大的主干网络和复杂的下游架构,这限制了其部署能力。为解决这一挑战并提升人群计数模型的通用性,我们提出了两种轻量级模型。这些模型在保持相同下游架构的同时,分别采用MobileNet和MobileViT作为两种不同的主干网络。我们利用相邻特征融合(Adjacent Feature Fusion)从预训练模型(PTM)中提取多尺度特征,并将这些特征无缝融合。该方法使模型在保持紧凑高效设计的同时实现更优性能。将我们提出的模型与现有最先进(SOTA)方法在ShanghaiTech-A、ShanghaiTech-B和UCF-CC-50数据集上进行比较,该模型在实现可比结果的同时,成为计算效率最优的模型。最后,我们通过对比研究、广泛的消融实验及剪枝分析,验证了模型的有效性。