Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significantly reducing the workload. We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data. However, these models struggle to reliably understand object quantities, leading to noisy annotations when prompted to produce images with a specific quantity of objects. To address this, we use latent diffusion models to create two types of synthetic data: one by removing pedestrians from real images, which generates ranked image pairs with a weak but reliable object quantity signal, and the other by generating synthetic images with a predetermined number of objects, offering a strong but noisy counting signal. Our method utilizes the ranking image pairs for pre-training and then fits a linear layer to the noisy synthetic images using these crowd quantity features. We report state-of-the-art results for unsupervised crowd counting.
翻译:人群计数是计算机视觉中的一项关键任务,具有多种重要应用。然而,现有的计数方法依赖密集的密度图标注,需要人工定位每一个行人。尽管近期研究尝试通过弱监督或半监督学习来减轻标注负担,但这些方法仍无法显著降低工作量。我们提出了一种新方法,通过利用潜在扩散模型生成合成数据来彻底消除标注负担。然而,这类模型难以可靠地理解物体数量,导致在提示生成特定数量物体的图像时产生噪声标注。为解决此问题,我们使用潜在扩散模型创建两种类型的合成数据:一种是通过从真实图像中移除行人生成的排序图像对(提供弱但可靠的物体数量信号),另一种是生成具有预定物体数量的合成图像(提供强但含噪的计数信号)。我们的方法利用排序图像对进行预训练,然后基于这些人群数量特征,为含噪的合成图像拟合一个线性层。我们报告了无监督人群计数的最新成果。