Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.
翻译:具身视觉追踪对执行复杂现实任务的无人机(UAV)至关重要。在具有复杂语义需求的动态城市场景中,视觉-语言-动作(VLA)模型凭借其跨模态融合与连续动作生成能力展现出巨大潜力。为在此类环境下对多模态追踪进行基准测试,我们构建了专用的评估基准及包含超过89万帧、176项任务和85种不同物体的大规模数据集。此外,针对现有VLA模型存在的时间特征冗余与空间几何先验缺失问题,我们提出改进型VLA追踪模型UAV-Track VLA。该模型基于$π_{0.5}$架构,引入时序压缩网络以高效捕捉帧间动态。同时,设计由空间感知辅助定位头与流匹配动作专家组成的并行双分支解码器,实现跨模态特征解耦与细粒度连续动作生成。在CARLA模拟器中的系统性实验验证了本方法的端到端优越性能。值得注意的是,在具有挑战性的远距离行人追踪任务中,UAV-Track VLA实现了61.76%的成功率与269.65的平均追踪帧数,显著优于现有基线模型。此外,该模型在未见环境中展现出稳健的零样本泛化能力,并将单步推理延迟相比原始$π_{0.5}$模型降低33.4%(至0.0571秒),支持高效实时的无人机控制。数据样本与演示视频见:https://github.com/Hub-Tian/UAV-Track\_VLA。