We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.
翻译:我们提出了一种新颖的层级化时空动作分词器,用于上下文中模仿学习。首先,我们提出了一种层级化方法,该方法包含两个连续的向量量化层级。具体而言,低层级将输入动作分配到细粒度子簇中,而高层级进一步将细粒度子簇映射到簇。我们的层级化方法在主要利用空间信息重建输入动作方面,性能优于非层级化对应方法。此外,我们通过同时利用时空线索扩展了该方法,形成名为HiST-AT的层级化时空动作分词器。具体而言,我们的层级化时空方法执行多级聚类,同时恢复输入动作及其关联时间戳。最终,在多个仿真和真实机器人操作基准上的广泛评估表明,我们的方法在上下文中模仿学习中实现了新的最优性能。