Generalization capabilities, or rather a lack thereof, is one of the most important unsolved problems in the field of robot learning, and while several large scale efforts have set out to tackle this problem, unsolved it remains. In this paper, we hypothesize that learning temporal action abstractions using latent variable models (LVMs), which learn to map data to a compressed latent space and back, is a promising direction towards low-level skills that can readily be used for new tasks. Although several works have attempted to show this, they have generally been limited by architectures that do not faithfully capture shareable representations. To address this we present Quantized Skill Transformer (QueST), which learns a larger and more flexible latent encoding that is more capable of modeling the breadth of low-level skills necessary for a variety of tasks. To make use of this extra flexibility, QueST imparts causal inductive bias from the action sequence data into the latent space, leading to more semantically useful and transferable representations. We compare to state-of-the-art imitation learning and LVM baselines and see that QueST's architecture leads to strong performance on several multitask and few-shot learning benchmarks. Further results and videos are available at https://quest-model.github.io/
翻译:泛化能力,或者说其缺乏,是机器人学习领域最重要的未解问题之一。尽管已有数项大规模研究致力于解决此问题,但它依然悬而未决。在本文中,我们假设使用隐变量模型学习时序动作抽象——这类模型学习将数据映射到压缩的隐空间并重建——是实现可即用于新任务的底层技能的一个有前景的方向。尽管已有若干工作尝试证明这一点,但它们通常受限于未能忠实捕获可共享表示的架构。为解决此问题,我们提出了量化技能Transformer(QueST),它学习一个更大、更灵活的隐编码,能更好地建模多种任务所需的底层技能广度。为利用这种额外的灵活性,QueST将来自动作序列数据的因果归纳偏置注入隐空间,从而产生语义上更有用且更可迁移的表示。我们与最先进的模仿学习和隐变量模型基线进行比较,发现QueST的架构在多项多任务和少样本学习基准测试中均取得了强劲性能。更多结果和视频可在 https://quest-model.github.io/ 获取。