Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd-counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.
翻译:多模态人群计数旨在从视觉图像与热成像/深度图像中同时估计人群密度。由于不同模态之间存在显著差异,该任务具有挑战性。本文提出一种新颖方法,通过引入辅助的代理模态,并在此基础上将该任务构建为三模态学习问题。我们设计了一种基于融合的方法来生成此代理模态,该方法利用了现代去噪扩散融合模型的一种非扩散、轻量化变体。此外,我们识别并解决了多模态人群计数中因直接跨模态图像融合导致的伪影效应。通过在主流多模态人群计数数据集上进行广泛的实验评估,我们证明了所提方法的有效性:该方法仅引入约400万额外参数,即可取得优异性能。代码已发布于 https://github.com/HenryCilence/Broker-Modality-Crowd-Counting。