Generalization capabilities, or rather a lack thereof, is one of the most important unsolved problems in the field of robot learning, and while several large scale efforts have set out to tackle this problem, unsolved it remains. In this paper, we hypothesize that learning temporal action abstractions using latent variable models (LVMs), which learn to map data to a compressed latent space and back, is a promising direction towards low-level skills that can readily be used for new tasks. Although several works have attempted to show this, they have generally been limited by architectures that do not faithfully capture shareable representations. To address this we present Quantized Skill Transformer (QueST), which learns a larger and more flexible latent encoding that is more capable of modeling the breadth of low-level skills necessary for a variety of tasks. To make use of this extra flexibility, QueST imparts causal inductive bias from the action sequence data into the latent space, leading to more semantically useful and transferable representations. We compare to state-of-the-art imitation learning and LVM baselines and see that QueST's architecture leads to strong performance on several multitask and few-shot learning benchmarks. Further results and videos are available at https://quest-model.github.io/
翻译:泛化能力,或者更准确地说,其缺乏,是机器人学习领域最重要且尚未解决的问题之一。尽管已有数项大规模研究致力于解决此问题,但该问题仍未得到解决。在本文中,我们假设,利用潜在变量模型学习时序动作抽象——这些模型学习将数据映射到压缩的潜在空间并映射回来——是实现可随时用于新任务的低层技能的一个有前景的方向。尽管已有若干研究试图证明这一点,但它们通常受限于未能忠实捕获可共享表示的架构。为了解决这个问题,我们提出了量化技能Transformer(QueST),它学习了一个更大、更灵活的潜在编码,能够更好地建模各种任务所需的低层技能的广度。为了利用这种额外的灵活性,QueST将来自动作序列数据的因果归纳偏置引入潜在空间,从而产生更具语义价值且更可迁移的表示。我们与最先进的模仿学习和LVM基线进行比较,发现QueST的架构在多项多任务和少样本学习基准测试中均取得了强劲的性能。更多结果和视频请访问 https://quest-model.github.io/。