We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action sequence generation via parallel decoding. Leveraging our B-spline formulation, BEAST inherently ensures generating smooth trajectories without discontinuities between adjacent segments. We extensively evaluate BEAST by integrating it with three distinct model architectures: a Variational Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with discrete tokens, and Florence-2, a pretrained Vision-Language Model with an encoder-decoder architecture, demonstrating BEAST's compatibility and scalability with large pretrained models. We evaluate BEAST across three established benchmarks consisting of 166 simulated tasks and on three distinct robot settings with a total of 8 real-world tasks. Experimental results demonstrate that BEAST (i) significantly reduces both training and inference computational costs, and (ii) consistently generates smooth, high-frequency control signals suitable for continuous control tasks while (iii) reliably achieves competitive task success rates compared to state-of-the-art methods.
翻译:本文提出B样条编码动作序列标记器(BEAST),这是一种新颖的动作标记器,利用B样条将动作序列编码为紧凑的离散或连续标记。与基于向量量化或字节对编码的现有动作标记器相比,BEAST无需单独的标记器训练,且始终生成统一长度的标记,从而通过并行解码实现快速动作序列生成。借助我们的B样条公式,BEAST本质上确保生成平滑轨迹,避免相邻段之间的不连续性。我们通过将BEAST与三种不同模型架构集成进行了全面评估:使用连续标记的变分自编码器(VAE)、使用离散标记的仅解码器Transformer,以及具有编码器-解码器架构的预训练视觉语言模型Florence-2,这证明了BEAST与大型预训练模型的兼容性和可扩展性。我们在包含166个模拟任务的三个成熟基准测试以及涵盖8个真实世界任务的三种不同机器人设置上评估BEAST。实验结果表明,BEAST(i)显著降低训练和推理计算成本,(ii)持续生成适用于连续控制任务的平滑高频控制信号,同时(iii)与最先进方法相比可靠地达到具有竞争力的任务成功率。