Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios.In this work, we propose a concise 3D MVC framework called \textbf{CountFormer}to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences.Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.
翻译:多视角计数方法相较于单视角方法展现出显著优势,尤其在遮挡严重和透视畸变显著的情况下。然而,传统多视角计数方法中手工设计的启发式特征以及对相机布局一致性的要求,限制了其在真实场景中的适用性和可扩展性。本研究提出了一种简洁的三维多视角计数框架——\textbf{CountFormer},旨在将多视角图像级特征提升至场景级体素表示,并基于体素特征估计三维密度图。通过引入相机编码策略,CountFormer成功将相机参数嵌入体素查询与图像级特征中,使其能够处理存在显著差异的各种相机布局。此外,我们设计了一个基于注意力机制的特征提升模块,将每个相机视角的图像级特征转换为三维体素表示。随后,多视角体素聚合模块通过注意力机制聚合多个视角的体素,生成全面的场景级体素表示,从而使CountFormer能够处理任意动态相机布局拍摄的图像。所提方法在多个广泛使用的数据集上均优于现有先进方法,证明其相较于传统多视角计数框架更适用于实际部署。