We propose a method to capture the handling abilities of fast jet pilots in a software model via reinforcement learning (RL) from human preference feedback. We use pairwise preferences over simulated flight trajectories to learn an interpretable rule-based model called a reward tree, which enables the automated scoring of trajectories alongside an explanatory rationale. We train an RL agent to execute high-quality handling behaviour by using the reward tree as the objective, and thereby generate data for iterative preference collection and further refinement of both tree and agent. Experiments with synthetic preferences show reward trees to be competitive with uninterpretable neural network reward models on quantitative and qualitative evaluations.
翻译:我们提出一种方法,通过人类偏好反馈的强化学习(RL)将快速喷气机飞行员的操控能力捕捉到软件模型中。我们利用模拟飞行轨迹上的成对偏好来学习一种名为奖励树的可解释规则模型,该模型能够自动对轨迹进行评分并提供解释性理由。我们使用奖励树作为目标来训练RL智能体执行高质量的操控行为,从而生成用于迭代偏好收集及进一步优化树和智能体的数据。使用合成偏好的实验表明,在定量和定性评估中,奖励树在性能上可与不可解释的神经网络奖励模型相媲美。