Reinforcement Learning is the premier technique to approach sequential decision problems, including complex tasks such as driving cars and landing spacecraft. Among the software validation and verification practices, testing for functional fault detection is a convenient way to build trustworthiness in the learned decision model. While recent works seek to maximise the number of detected faults, none consider fault characterisation during the search for more diversity. We argue that policy testing should not find as many failures as possible (e.g., inputs that trigger similar car crashes) but rather aim at revealing as informative and diverse faults as possible in the model. In this paper, we explore the use of quality diversity optimisation to solve the problem of fault diversity in policy testing. Quality diversity (QD) optimisation is a type of evolutionary algorithm to solve hard combinatorial optimisation problems where high-quality diverse solutions are sought. We define and address the underlying challenges of adapting QD optimisation to the test of action policies. Furthermore, we compare classical QD optimisers to state-of-the-art frameworks dedicated to policy testing, both in terms of search efficiency and fault diversity. We show that QD optimisation, while being conceptually simple and generally applicable, finds effectively more diverse faults in the decision model, and conclude that QD-based policy testing is a promising approach.
翻译:强化学习是处理序列决策问题的主要技术,包括驾驶汽车和着陆航天器等复杂任务。在软件验证与确认实践中,功能性故障检测测试是建立学习决策模型可信度的一种便捷方式。尽管近期研究致力于最大化检测到的故障数量,但均未在搜索过程中考虑故障表征以提升多样性。我们认为,策略测试不应追求发现尽可能多的失败(例如触发相似车辆碰撞的输入),而应旨在揭示模型中更具信息量和多样性的故障。本文探索利用质量多样性优化来解决策略测试中的故障多样性问题。质量多样性(QD)优化是一种进化算法,用于求解需要高质量且多样化解的硬组合优化问题。我们定义并解决将QD优化适配至动作策略测试所面临的潜在挑战。此外,我们从搜索效率和故障多样性两个维度,将经典QD优化器与专用于策略测试的最新框架进行对比。研究表明,QD优化虽概念简单且普遍适用,却能更有效地发现决策模型中更多样化的故障。由此得出结论:基于QD的策略测试是一种富有前景的方法。