Recent years have witnessed surprising achievements of decision-making policies across various fields, such as autonomous driving and robotics. Testing for decision-making policies is crucial with the existence of critical scenarios that may threaten their reliability. Numerous research efforts have been dedicated to testing these policies. However, there are still significant challenges, such as low testing efficiency and diversity due to the complexity of the policies and environments under test. Inspired by the remarkable capabilities of large language models (LLMs), in this paper, we propose an LLM-driven online testing framework for efficiently testing decision-making policies. The main idea is to employ an LLM-based test scenario generator to intelligently generate challenging test cases through contemplation and reasoning. Specifically, we first design a "generate-test-feedback" pipeline and apply templated prompt engineering to fully leverage the knowledge and reasoning abilities of LLMs. Then, we introduce a multi-scale scenario generation strategy to address the inherent challenges LLMs face in making fine adjustments, further enhancing testing efficiency. Finally, we evaluate the LLM-driven approach on five widely used benchmarks. The experimental results demonstrate that our method significantly outperforms baseline approaches in uncovering both critical and diverse scenarios.
翻译:近年来,决策策略在自动驾驶和机器人等多个领域取得了令人瞩目的成就。由于存在可能威胁其可靠性的关键场景,对决策策略进行测试至关重要。已有大量研究工作致力于测试这些策略。然而,由于被测策略和环境的复杂性,测试仍面临重大挑战,例如测试效率低和多样性不足。受大型语言模型卓越能力的启发,本文提出了一种LLM驱动的在线测试框架,用于高效测试决策策略。其主要思想是采用基于LLM的测试场景生成器,通过深思熟虑和推理智能地生成具有挑战性的测试用例。具体而言,我们首先设计了一个“生成-测试-反馈”流程,并应用模板化提示工程,以充分利用LLM的知识和推理能力。然后,我们引入了一种多尺度场景生成策略,以解决LLM在进行精细调整时固有的困难,从而进一步提高测试效率。最后,我们在五个广泛使用的基准测试上评估了这种LLM驱动的方法。实验结果表明,我们的方法在发现关键且多样化的场景方面显著优于基线方法。