We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage
翻译:本文提出SOAR,一种针对无人机航拍视频的新型自监督预训练算法。我们在整个预训练过程中融入人体目标先验知识,以提升无人机视频预训练效率及下游动作识别性能。这与先前主要在微调阶段引入物体信息的研究形成鲜明对比。具体而言,我们首先提出一种新颖的物体感知掩码策略,该策略旨在预训练阶段持续保留与目标物体相关图像块的可见性。其次,我们引入一种物体感知损失函数,利用物体信息调整重建损失,防止模型偏向信息量较少的背景区域。实验表明,采用标准ViT骨干网络的SOAR在NEC-Drone和UAV-Human数据集上的Top-1准确率分别提升9.7%和21.4%,超越现有最优无人机动作识别模型,同时实现每视频18.7毫秒的推理速度,提速2至5倍。此外,SOAR在达到与先前自监督学习方法相当精度的同时,所需预训练时间减少87.5%,内存占用降低25%。