NoTVLA: Semantics-Preserving Robot Adaptation via Narrative Action Interfaces

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

翻译：视觉-语言-动作（VLA）模型代表了具身智能的关键进展，但其在实际部署中面临重大障碍，尤其是灾难性遗忘问题。这一问题源于模型对连续动作序列或动作块的过度依赖，这种依赖会意外形成孤立的数据孤岛，破坏跨任务的知识保留。为应对这些挑战，我们提出轨迹VLA窄化（NoTVLA）框架：一种通过聚焦稀疏轨迹来规避密集轨迹微调导致灾难性遗忘的创新方法。NoTVLA的核心创新在于其轨迹规划策略：不再以目标物体的轨迹为中心，而是针对机器人末端执行器的轨迹，采用时间压缩和空间推理剪枝技术。此外，训练过程使用这些稀疏轨迹而非密集动作轨迹，这一优化在零样本场景中带来了显著的实用优势与更优性能。在多任务评估场景中，NoTVLA在两种关键约束条件下实现优于pi0的性能与泛化能力：其计算功耗比pi0低一个数量级以上，且无需腕部摄像头。该设计确保NoTVLA的操作精度接近单任务专家模型水平。尤为关键的是，它保留了模型固有的语言能力，可在特定场景中实现零样本泛化，支持跨多种机器人平台的统一模型部署，并在以新视角感知任务时保持一定程度的泛化能力。