The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub \url{https://github.com/AmazingDay1/AV-DETC}.
翻译:随着紧凑型无人机的日益普及,公共安全面临重大威胁,而传统无人机检测系统往往体积庞大且成本高昂。为应对这些挑战,我们提出了AV-DTEC,一种基于自监督视听融合的轻量级反无人机系统。AV-DTEC采用激光雷达生成标签进行自监督学习训练,并通过并行选择性状态空间模型同时学习音频与视觉特征。利用学习到的特征,一个专门设计的即插即用主-辅助特征增强模块将视觉特征整合到音频特征中,以提升跨光照条件下的鲁棒性。为降低对辅助特征的依赖并实现模态对齐,我们提出了一种自适应调整视觉特征权重的师生模型。AV-DTEC在真实世界多模态数据中展现出卓越的准确性与有效性。相关代码与训练模型已在GitHub公开:\url{https://github.com/AmazingDay1/AV-DETC}。