CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $π_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and $π_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.

翻译：基于流的视觉-语言-动作（VLA）策略为动作生成提供了强大的表达能力，但存在根本性的效率缺陷：从无信息的高斯噪声恢复动作结构需要多步推理，导致在实时约束下性能与效率难以兼顾。我们通过重新思考生成式动作建模中起点的作用来解决这一问题。不同于缩短采样轨迹，我们提出CF-VLA——一种由粗到精的两阶段框架，将动作生成重构为：先进行粗粒度初始化步骤，构建具有动作感知能力的起点；再通过单步局部精化修正残差。具体而言，粗阶段学习末端速度的条件后验分布，将高斯噪声转化为结构化的初始化表征；细阶段在此初始化基础上执行固定时长的精化过程。为稳定训练，我们引入分步策略：先学习受控的粗预测器，再进行联合优化。在CALVIN和LIBERO上的实验表明，本方法在低NFE（函数评估次数）场景下建立了强劲的性能-效率边界：它持续优于现有NFE=2方法，在多项指标上匹配甚至超越NFE=10的π₀.₅基线，将动作采样延迟降低75.4%，并实现最优的实际机器人平均成功率83.0%，较MIP提升19.5个百分点，较π₀.₅提升4.0个百分点。这些结果表明，结构化的由粗到精生成范式能够同时实现强性能和高效推理。我们的代码已开源：https://github.com/EmbodiedAI-RoboTron/CF-VLA。