We introduce FALCON, a unified self-supervised video pretraining approach for UAV action recognition from raw RGB aerial footage, requiring no additional preprocessing at inference. UAV videos exhibit severe spatial imbalance: large, cluttered backgrounds dominate the field of view, causing reconstruction-based pretraining to waste capacity on uninformative regions and under-learn action-relevant human/object cues. FALCON addresses this by integrating object-aware masked autoencoding with object-centric dual-horizon future reconstruction. Using detections only during pretraining, we construct objectness priors that (i) enforce balanced token visibility during masking and (ii) concentrate reconstruction supervision on action-relevant regions, preventing learning from being dominated by background appearance. To promote temporal dynamics learning, we further reconstruct short- and long-horizon future content within an object-centric supervision region, injecting anticipatory temporal supervision that is robust to noisy aerial context. Across UAV benchmarks, FALCON improves top-1 accuracy by 2.9\% on NEC-Drone and 5.8\% on UAV-Human with a ViT-B backbone, while achieving 2$\times$--5$\times$ faster inference than supervised approaches that rely on heavy test-time augmentation.
翻译:本文提出FALCON,一种用于无人机动作识别的统一自监督视频预训练方法,可直接处理原始RGB航拍影像,无需在推理阶段进行额外预处理。无人机视频存在严重的空间不平衡问题:庞大而杂乱的背景占据视野主导地位,导致基于重建的预训练方法将模型容量浪费在信息贫乏区域,难以充分学习与动作相关的人体/物体特征。FALCON通过整合对象感知掩码自编码与对象中心化双时间跨度未来重建来解决这一问题。仅在预训练阶段使用检测结果,我们构建了对象性先验以:(i)在掩码过程中强制实现均衡的令牌可见性;(ii)将重建监督集中于动作相关区域,防止学习过程被背景外观主导。为促进时序动态学习,我们进一步在对象中心化监督区域内重建短时与长时未来内容,注入对噪声航拍环境具有鲁棒性的前瞻性时序监督。在无人机基准测试中,采用ViT-B骨干网络时,FALCON在NEC-Drone数据集上提升Top-1准确率2.9%,在UAV-Human数据集上提升5.8%,同时推理速度比依赖复杂测试时数据增强的监督方法快2至5倍。