We present Token Steering (TS), a method for dynamically steering trajectories generated by an autoregressive vision-language-action (VLA) model through direct intervention in the action-token space. TS injects low-dimensional user inputs into the model's native action-token representation, allowing users to influence trajectory generation without modifying the underlying vision-language model (VLM) architecture. Because TS operates entirely at inference time, it requires no additional training or finetuning. User inputs guide rather than override the pretrained policy, allowing users to influence robot actions while preserving the dexterity, smoothness, and task priors learned by the VLA. We evaluate TS on two household manipulation tasks -- drawer closing after object placement and state-aware object swapping -- and improve success rates from 10.0% to 72.5% and from 16.7% to 93.8%, respectively. By enabling lightweight, intuitive steering over robot foundation models, our interface has the potential to improve human-robot interaction in consumer environments and broaden accessibility for individuals with limited physical control. Project website: https://jasontchan.github.io/token-steering/ .
翻译:我们提出Token Steering(TS)方法,一种通过直接在动作标记空间进行干预,动态引导自回归视觉-语言-动作(VLA)模型生成轨迹的技术。TS将低维用户输入注入模型原生的动作标记表示,使用户能在不修改底层视觉-语言模型(VLM)架构的情况下影响轨迹生成。由于TS完全在推理阶段运行,无需额外训练或微调。用户输入引导而非覆盖预训练策略,允许用户在保持VLA学习到的灵巧性、平滑性和任务先验的同时影响机器人动作。我们在两项家务操作任务——物体放置后的抽屉关闭和状态感知的物体交换——上评估了TS,分别将成功率从10.0%提升至72.5%,以及从16.7%提升至93.8%。通过实现对机器人基础模型的轻量级、直观引导,我们的界面有望改善消费环境中的 人机交互,并拓宽行动能力受限群体的可及性。项目网站:https://jasontchan.github.io/token-steering/。