Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track_VLA.
翻译:具身视觉追踪对于执行复杂现实任务的无人飞行器至关重要。在具有复杂语义需求的动态城市场景中,视觉-语言-动作模型因其跨模态融合和连续动作生成能力展现出巨大潜力。为了对此类环境下的多模态追踪进行基准测试,我们构建了一个专用评估基准和一个大规模数据集,涵盖超过89万帧图像、176个任务和85种不同物体。此外,针对现有VLA模型中存在的时间特征冗余和空间几何先验缺失问题,我们提出了一种改进的VLA追踪模型——无人机追踪VLA。该模型基于$π_{0.5}$架构,引入了一个时间压缩网络以高效捕获帧间动态。同时,设计了一个并行双分支解码器,包括空间感知辅助接地头和流匹配动作专家,用于解耦跨模态特征并生成细粒度的连续动作。在CARLA模拟器中的系统实验验证了我们方法的端到端优越性能。值得注意的是,在具有挑战性的远程行人追踪任务中,无人机追踪VLA实现了61.76%的成功率和269.65帧的平均追踪长度,显著优于现有基线方法。此外,它在未见环境中表现出稳健的零样本泛化能力,并将单步推理延迟相比原始$π_{0.5}$降低了33.4%(降至0.0571秒),实现了高效的实时无人机控制。数据样本和演示视频可在以下链接获取:https://github.com/Hub-Tian/UAV-Track_VLA。