In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints are available at https://action-tokenizer-matters.github.io/
翻译:上下文模仿学习(ICIL)是一种新范式,使机器人能够在不重新训练的情况下,从演示泛化到未见任务。结构良好的动作表示是有效捕获演示信息的关键,然而动作分词器(即动作离散化与编码的过程)在ICIL中仍未得到充分探索。本研究首先系统评估了ICIL中现有的动作分词器方法,并揭示了一个关键局限:虽然这些方法能有效编码动作轨迹,却无法保持时间平滑性,而这对机器人稳定执行至关重要。为解决此问题,我们提出LipVQ-VAE——一种通过权重归一化在潜在动作空间强制满足利普希茨条件的变分自编码器。通过将平滑性约束从原始动作输入传播至量化潜在码本,LipVQ-VAE能生成更稳定、更平滑的动作。当集成到ICIL中时,LipVQ-VAE在高保真模拟器中将性能提升超过5.3%,真实世界实验也证实了其生成更平滑、更可靠轨迹的能力。代码与检查点公开于https://action-tokenizer-matters.github.io/