In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.
翻译:本文提出了一种新颖的、富含运动学信息的视觉-语言-动作任务。与现有仅粗略或部分捕捉运动学特征的动作指令不同,该任务中的语言指令从起始到完成,在关键时间点密集编码了多样的运动学属性(如方向、轨迹、朝向和相对位移),从而支持细粒度和个性化的操作。在此设定下,任务目标保持不变,而执行轨迹必须适应指令级别的运动学规范。为应对这一挑战,我们提出了KineVLA,这是一个视觉-语言-动作框架,它通过双层动作表示和双层推理标记,明确地将目标层面的不变性与运动学层面的可变性解耦。这些推理标记作为显式的、受监督的中间变量,用以对齐语言和动作。为支持此任务,我们构建了涵盖仿真和真实机器人平台的感知运动学VLA数据集,该数据集具有指令级别的运动学变化和双层标注。在LIBERO和Realman-75机器人上进行的大量实验表明,KineVLA在运动学敏感基准测试中始终优于强大的VLA基线,实现了更精确、可控和可泛化的操作行为。