Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model with simple CBAM-based bi-modal attention fails to handle these issues and often misses vehicles. We propose HAVT-IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HAVT-IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.
翻译:怠速车辆检测(IVD)利用监控视频与多通道音频,在接送区域最后一帧中定位车辆并将其分类为行驶、怠速或熄火状态。IVD面临三大挑战:(i)视觉线索与音频模式间的模态异质性;(ii)检测框尺度变化大,需多分辨率检测;(iii)耦合检测头导致的训练不稳定。先前基于简单CBAM的双模态注意力端到端模型无法有效应对这些问题,常出现漏检。我们提出HAVT-IVD——一种具备视觉特征金字塔与解耦检测头的异质性感知网络。实验表明,HAVT-IVD相较于分离式基线模型将mAP提升了7.66,较端到端基线模型提升了9.42。