Recent advances in ML suggest that the quantity of data available to a model is one of the primary bottlenecks to high performance. Although for language-based tasks there exist almost unlimited amounts of reasonably coherent data to train from, this is generally not the case for Reinforcement Learning, especially when dealing with a novel environment. In effect, even a relatively trivial continuous environment has an almost limitless number of states, but simply sampling random states and actions will likely not provide transitions that are interesting or useful for any potential downstream task. How should one generate massive amounts of useful data given only an MDP with no indication of downstream tasks? Are the quantity and quality of data truly transformative to the performance of a general controller? We propose to answer both of these questions. First, we introduce a principled unsupervised exploration method, ChronoGEM, which aims to achieve uniform coverage over the manifold of achievable states, which we believe is the most reasonable goal given no prior task information. Secondly, we investigate the effects of both data quantity and data quality on the training of a downstream goal-achievement policy, and show that both large quantities and high-quality of data are essential to train a general controller: a high-precision pose-achievement policy capable of attaining a large number of poses over numerous continuous control embodiments including humanoid.
翻译:近期机器学习领域的进展表明,模型可用的数据量是制约高性能的主要瓶颈之一。尽管语言类任务存在几乎无限量的合理连贯数据可供训练,但对强化学习而言,特别是在面对新环境时,情况通常并非如此。实际上,即使是相对简单的连续环境也拥有近乎无限的状态空间,但单纯随机采样状态和动作通常无法为目标下游任务提供有趣或有用的转移样本。如何在仅给定马尔可夫决策过程(MDP)且无下游任务指示的条件下生成海量有用数据?数据的数量和质量是否真能对通用控制器的性能产生变革性影响?我们提出要回答这两个问题。首先,我们引入一种原则性的无监督探索方法ChronoGEM,旨在均匀覆盖可达状态流形——我们认为在无先验任务信息的情况下这是最合理的目标。其次,我们系统研究了数据数量和质量对下游目标达成策略训练的影响,结果表明大量高质量数据对训练通用控制器至关重要:这种高精度位姿达成策略能够使包括人形机器人在内的多种连续控制实体实现大规模位姿控制。