Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled Expectations in Real-World Applications

More information leads to better decisions and predictions, right? Confirming this hypothesis, several studies concluded that the simultaneous use of optical and thermal images leads to better predictions in crowd counting. However, the way multimodal models extract enriched features from both modalities is not yet fully understood. Since the use of multimodal data usually increases the complexity, inference time, and memory requirements of the models, it is relevant to examine the differences and advantages of multimodal compared to monomodal models. In this work, all available multimodal datasets for crowd counting are used to investigate the differences between monomodal and multimodal models. To do so, we designed a monomodal architecture that considers the current state of research on monomodal crowd counting. In addition, several multimodal architectures have been developed using different multimodal learning strategies. The key components of the monomodal architecture are also used in the multimodal architectures to be able to answer whether multimodal models perform better in crowd counting in general. Surprisingly, no general answer to this question can be derived from the existing datasets. We found that the existing datasets hold a bias toward thermal images. This was determined by analyzing the relationship between the brightness of optical images and crowd count as well as examining the annotations made for each dataset. Since answering this question is important for future real-world applications of crowd counting, this paper establishes criteria for a potential dataset suitable for answering whether multimodal models perform better in crowd counting in general.

翻译：信息越多决策与预测越准确，对吗？为验证这一假设，多项研究指出同时使用光学与热成像图像可提升人群计数预测精度。然而，跨模态模型从双模态中提取增强特征的内在机理尚未完全明晰。由于跨模态数据通常会增加模型复杂度、推理时间与内存需求，亟需系统探究跨模态相较于单模态模型的差异与优势。本研究利用现有全部跨模态人群计数数据集，系统对比单模态与跨模态模型性能。为此，我们基于单模态人群计数研究前沿设计了单模态架构，并采用多种跨模态学习策略开发了多类跨模态架构。为确保可比性，跨模态架构的核心组件与单模态架构保持一致，旨在回答"跨模态模型是否普遍优于单模态人群计数模型"这一关键问题。出乎意料的是，现有数据集无法为此问题提供统一定论。研究发现现有数据集存在对热成像图像的偏置，这一结论通过分析光学图像亮度与人群数量的相关性及检验各数据集标注特征得出。鉴于该问题对人群计数未来实际应用具有重要指导意义，本文建立了适用于验证"跨模态模型是否普遍更优"的潜在数据集评价标准。