Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.
翻译:在室内环境中,外部多摄像头感知技术日益重要,静态摄像头网络必须在遮挡和异构视角条件下支持多目标跟踪。本研究评估了Sparse4D——一种基于查询的时空三维检测与跟踪框架,该框架在共享世界坐标系中融合多视角特征,并通过实例记忆传播稀疏目标查询。我们系统研究了降低输入帧率、训练后量化(INT8与FP8)、向WILDTRACK基准的迁移以及Transformer Engine混合精度微调等优化策略。为更准确评估身份稳定性,我们提出平均轨迹持续时间(AvgTrackDur)指标,该指标以秒为单位度量身份保持的持续性。实验表明:Sparse4D在适度降低帧率时保持稳定,但当帧率低于2 FPS时,即使检测结果稳定,身份关联仍会失效;骨干网络与特征融合层的选择性量化实现了最佳的速度-精度平衡,而注意力相关模块对低精度量化始终敏感;在WILDTRACK数据集上,低帧率预训练相较基础检查点带来显著的零样本性能提升,而小规模微调仅产生有限增益;Transformer Engine混合精度技术虽能降低延迟并提升摄像头扩展性,但可能破坏身份传播的稳定性,这凸显了稳定性验证的必要性。