Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning "inside-out" autonomous driving models to "outside-in" static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$. Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a $2.15\times$ speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.
翻译:精确的三维物体感知与多目标多摄像头跟踪是实现工业基础设施数字化转型的基础。然而,将"由内向外"的自动驾驶模型迁移至"由外向内"的静态摄像头网络时,由于异构的摄像头布设与极端遮挡,带来了重大挑战。本文提出了一种专门针对大规模基础设施环境优化的改进型Sparse4D框架。该系统利用绝对世界坐标几何先验,并引入一种遮挡感知的ReID嵌入模块,以在分布式传感器网络中保持身份稳定性。为了在不依赖人工标注的情况下弥合仿真到现实的领域差距,我们采用基于NVIDIA COSMOS框架的生成式数据增强策略,创建多样化的环境风格以提升模型的外观不变性。在AI City Challenge 2025基准测试中评估,我们的纯摄像头框架实现了45.22的HOTA,达到当前最优水平。此外,我们通过为多尺度可变形聚合开发优化的TensorRT插件来应对实时部署约束。我们的硬件加速实现在现代GPU架构上实现了2.15倍的加速,使得单个Blackwell级GPU能够支持超过64路并发摄像头流。