Diffusion-based Data Augmentation for Object Counting Problems

Crowd counting is an important problem in computer vision due to its wide range of applications in image understanding. Currently, this problem is typically addressed using deep learning approaches, such as Convolutional Neural Networks (CNNs) and Transformers. However, deep networks are data-driven and are prone to overfitting, especially when the available labeled crowd dataset is limited. To overcome this limitation, we have designed a pipeline that utilizes a diffusion model to generate extensive training data. We are the first to generate images conditioned on a location dot map (a binary dot map that specifies the location of human heads) with a diffusion model. We are also the first to use these diverse synthetic data to augment the crowd counting models. Our proposed smoothed density map input for ControlNet significantly improves ControlNet's performance in generating crowds in the correct locations. Also, Our proposed counting loss for the diffusion model effectively minimizes the discrepancies between the location dot map and the crowd images generated. Additionally, our innovative guidance sampling further directs the diffusion process toward regions where the generated crowd images align most accurately with the location dot map. Collectively, we have enhanced ControlNet's ability to generate specified objects from a location dot map, which can be used for data augmentation in various counting problems. Moreover, our framework is versatile and can be easily adapted to all kinds of counting problems. Extensive experiments demonstrate that our framework improves the counting performance on the ShanghaiTech, NWPU-Crowd, UCF-QNRF, and TRANCOS datasets, showcasing its effectiveness.

翻译：人群计数是计算机视觉中因在图像理解中广泛应用而重要的研究方向。目前该问题通常采用深度学习方法解决，如卷积神经网络（CNN）和Transformer。然而，深度网络具有数据驱动特性，在可用标注人群数据集有限时容易过拟合。为克服这一局限，我们设计了一套利用扩散模型生成大量训练数据的流水线。我们首次实现了基于位置点图（指定人体头部位置的二值点图）条件生成图像的扩散模型，并首次利用这些多样化合成数据增强人群计数模型。我们提出的ControlNet平滑密度图输入显著提升了其在正确位置生成人群的能力。此外，提出的扩散模型计数损失有效减少了位置点图与生成人群图像之间的差异。创新性地引入引导采样方法，进一步引导扩散过程聚焦于生成人群图像与位置点图最匹配的区域。综合而言，我们增强了ControlNet根据位置点图生成指定物体的能力，可用于各类计数问题的数据增强。该框架具有通用性，可便捷适配各类计数任务。大量实验表明，我们的框架在ShanghaiTech、NWPU-Crowd、UCF-QNRF和TRANCOS数据集上均能提升计数性能，验证了其有效性。