Modern reinforcement learning systems produce many high-quality policies throughout the learning process. However, to choose which policy to actually deploy in the real world, they must be tested under an intractable number of environmental conditions. We introduce RPOSST, an algorithm to select a small set of test cases from a larger pool based on a relatively small number of sample evaluations. RPOSST treats the test case selection problem as a two-player game and optimizes a solution with provable $k$-of-$N$ robustness, bounding the error relative to a test that used all the test cases in the pool. Empirical results demonstrate that RPOSST finds a small set of test cases that identify high quality policies in a toy one-shot game, poker datasets, and a high-fidelity racing simulator.
翻译:现代强化学习系统在学习过程中能够生成许多高质量的策略。然而,为了选择最终实际部署的策略,必须在难以计数的环境条件下对这些策略进行测试。本文提出RPOSST算法,该算法基于相对较少的样本评估,从大量候选测试用例中筛选出一个小型测试集。RPOSST将测试用例选择问题建模为双人博弈,并通过可证明的$k$-of-$N$鲁棒性优化求解,使得其结果与使用全部测试用例的测试相比,误差边界可控。实验结果表明,在玩具一次性博弈、扑克数据集以及高保真赛车模拟器中,RPOSST能够利用少量测试用例有效识别出高质量的策略。