We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.
翻译:我们提出视觉-语言-动作模型的流控制方法——一种通过通用输入(如键盘)实时引导VLA动作的简单有效方式。该方法无需重新训练或微调VLA即可开箱使用,允许用户通过相对粗糙的输入引导VLA对齐其意图。VLA将输入转换为从训练习得的VLA专家动作分布中抽取的动作样本,从而生成兼具高质量(符合动作专家分布)与高保真度(反映用户意图)的动作。实验表明,流控制具备多项理想特性:(1) 能准确且响应性地通过用户输入引导机器人动作,(2) 对次优用户输入具有鲁棒性,(3) 可引导VLA显著提升任务成功率与完成速度,(4) 在流控制轨迹上微调VLA能改进自主策略。这些结果共同为用户提供了一种简单直观的VLA动作引导方式,从而提升任务性能。