This work studies a Reinforcement Learning (RL) problem in which we are given a set of trajectories collected with K baseline policies. Each of these policies can be quite suboptimal in isolation, and have strong performance in complementary parts of the state space. The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space. We propose a simple imitation learning based algorithm, show a sample complexity bound on its accuracy and prove that the the algorithm is minimax optimal by showing a matching lower bound. Further, we apply the algorithm in the setting of machine learning guided compiler optimization to learn policies for inlining programs with the objective of creating a small binary. We demonstrate that we can learn a policy that outperforms an initial policy learned via standard RL through a few iterations of our approach.
翻译:本研究探讨了一个强化学习问题,其中我们获得了一组由K条基线策略采集的轨迹。每条策略在单独使用时可能表现欠佳,但在状态空间的不同互补区域具有较强性能。目标在于学习一种策略,使其在整个状态空间上的表现媲美基线与最优组合的性能。我们提出了一种基于简单模仿学习的算法,给出了其精度的样本复杂度上界,并通过证明匹配的下界表明该算法达到极小化最优性。此外,我们将该算法应用于机器学习引导的编译器优化场景,通过学习策略对内联程序进行优化,以生成小型二进制文件为目标。实验证明,通过我们方法的数次迭代,可以学习到一种优于通过标准强化学习初始策略的策略。