As small unmanned aerial vehicles (UAVs) become increasingly prevalent, there is growing concern regarding their impact on public safety and privacy, highlighting the need for advanced tracking and trajectory estimation solutions. In response, this paper introduces a novel framework that utilizes audio array for 3D UAV trajectory estimation. Our approach incorporates a self-supervised learning model, starting with the conversion of audio data into mel-spectrograms, which are analyzed through an encoder to extract crucial temporal and spectral information. Simultaneously, UAV trajectories are estimated using LiDAR point clouds via unsupervised methods. These LiDAR-based estimations act as pseudo labels, enabling the training of an Audio Perception Network without requiring labeled data. In this architecture, the LiDAR-based system operates as the Teacher Network, guiding the Audio Perception Network, which serves as the Student Network. Once trained, the model can independently predict 3D trajectories using only audio signals, with no need for LiDAR data or external ground truth during deployment. To further enhance precision, we apply Gaussian Process modeling for improved spatiotemporal tracking. Our method delivers top-tier performance on the MMAUD dataset, establishing a new benchmark in trajectory estimation using self-supervised learning techniques without reliance on ground truth annotations.
翻译:随着小型无人机日益普及,公众对其在安全与隐私方面的影响愈发关注,这凸显了对先进跟踪与轨迹估计技术的迫切需求。为此,本文提出了一种利用音频阵列进行三维无人机轨迹估计的新颖框架。我们的方法采用自监督学习模型,首先将音频数据转换为梅尔频谱图,并通过编码器分析以提取关键的时频信息。同时,通过无监督方法利用激光雷达点云估计无人机轨迹。这些基于激光雷达的估计结果作为伪标签,使得无需标注数据即可训练音频感知网络。在此架构中,基于激光雷达的系统作为教师网络,指导作为学生网络的音频感知网络。训练完成后,该模型能够仅依靠音频信号独立预测三维轨迹,在部署过程中无需激光雷达数据或外部真实轨迹信息。为进一步提升精度,我们应用高斯过程建模以改进时空跟踪性能。我们的方法在MMAUD数据集上实现了顶尖性能,为不依赖真实标注的自监督学习轨迹估计技术设立了新的基准。