AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $π_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.

翻译：空中操控系统长期受困于端到端控制中的表示耦合问题，这是由于飞行器平台级无人机运动与末端执行器级机械臂操作在动作尺度、动力学特性和控制目标上存在显著差异。本文提出专为空中操控设计的流匹配动作生成架构AIR-VLA+，其核心为级联双动作解码器与非对称特征级混合专家（MoE）模块。我们构建了级联的操作解码器与运动解码器，使无人机在运动过程中能单向观测机械臂的意图以实现工作流程协调，同时隔离无人机运动信息反向传播对机械臂操作稳定性的影响。针对无人机运动高度依赖高层语义且负责空中操控任务状态转换的特性，我们为无人机运动解码器设计了输入特征增强模块：通过引入隐式视觉抓取投影器感知夹爪与物体的交互状态，并注入压缩后的全局语义特征。在无人机运动解码器内部部署隐式混合专家架构，使不同运动专家在训练过程中自发展现出对不同任务阶段的能力倾向；通过在特征流形上进行密集软混合计算，无人机运动获得更强的任务阶段适应性。在标准化AIR-VLA基准上的实验表明，本方法以48.0的整体平均分数全面超越所有基线方法，与单头π₀.₅策略相比整体任务完成分数提升80.2%，有效缓解了复合机器人的异构协调控制冲突。