Idling vehicle detection (IVD) can be helpful in monitoring and reducing unnecessary idling and can be integrated into real-time systems to address the resulting pollution and harmful products. The previous approach [13], a non-end-to-end model, requires extra user clicks to specify a part of the input, making system deployment more error-prone or even not feasible. In contrast, we introduce an end-to-end joint audio-visual IVD task designed to detect vehicles visually under three states: moving, idling and engine off. Unlike feature co-occurrence task such as audio-visual vehicle tracking, our IVD task addresses complementary features, where labels cannot be determined by a single modality alone. To this end, we propose AVIVD-Net, a novel network that integrates audio and visual features through a bidirectional attention mechanism. AVIVD-Net streamlines the input process by learning a joint feature space, reducing the deployment complexity of previous methods. Additionally, we introduce the AVIVD dataset, which is seven times larger than previous datasets, offering significantly more annotated samples to study the IVD problem. Our model achieves performance comparable to prior approaches, making it suitable for automated deployment. Furthermore, by evaluating AVIVDNet on the feature co-occurrence public dataset MAVD [23], we demonstrate its potential for extension to self-driving vehicle video-camera setups.
翻译:怠速车辆检测(IVD)有助于监测并减少不必要的怠速现象,并可集成至实时系统中以应对由此产生的污染及有害产物。先前方法[13]作为一种非端到端模型,需要额外的用户点击来指定部分输入,导致系统部署更易出错甚至不可行。相比之下,我们提出一种端到端的联合视听IVD任务,旨在通过视觉检测车辆的三种状态:行驶、怠速及熄火。与视听车辆跟踪等特征共现任务不同,我们的IVD任务处理互补性特征——其标签无法仅通过单一模态确定。为此,我们提出AVIVD-Net,这是一种通过双向注意力机制融合视听特征的新型网络。AVIVD-Net通过学习联合特征空间简化了输入流程,降低了先前方法的部署复杂度。此外,我们构建了AVIVD数据集,其规模为现有数据集的七倍,提供了更丰富的标注样本来研究IVD问题。我们的模型取得了与现有方法相当的性能,适用于自动化部署。进一步地,通过在特征共现公开数据集MAVD[23]上评估AVIVD-Net,我们证明了其扩展至自动驾驶车辆摄像系统的潜力。