Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/
翻译:离散视觉-语言-动作(VLA)模型通常将动作生成建模为在离散动作空间上的下一个词元预测,并自回归地基于先前上下文条件化每个词元。尽管该方法有效,但会带来高推理延迟,且很大程度上忽略了动作轨迹中固有的时序结构。近期研究引入并行解码以提升效率,实现更快的推理,但缺乏对词元间依赖关系建模的显式机制。本文提出TBD-VLA——一种基于离散词元的VLA框架,通过引入块扩散实现时序动作生成。我们将动作序列划分为时序块,并在每个块内执行掩码离散扩散,同时保持跨块的自回归生成。该设计统一了时序自回归与并行动作解码,在实现强时序连贯性的同时提升了推理速度。此外,显式时序建模使得通过时序修补实现动作块(如实时分块)的异步执行成为可能。TBD-VLA在仿真和真实操作任务中均显著优于先前VLA方法,为构建快速且具备时序感知能力的离散VLA模型提供了可扩展路径。项目网页:https://tbd-vla.github.io/