Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control

Learning high-performance control policies that remain consistent with expert behavior is a fundamental challenge in robotics. Reinforcement learning can discover high-performing strategies but often departs from desirable human behavior, whereas imitation learning is limited by demonstration quality and struggles to improve beyond expert data. We propose a behavior-constrained reinforcement learning framework that improves beyond demonstrations while explicitly controlling deviation from expert behavior. Because expert-consistent behavior in dynamic control is inherently trajectory-level, we introduce a receding-horizon predictive mechanism that models short-term future trajectories and provides look-ahead rewards during training. To account for the natural variability of human behavior under disturbances and changing conditions, we further condition the policy on reference trajectories, allowing it to represent a distribution of expert-consistent behaviors rather than a single deterministic target. Empirically, we evaluate the approach in high-fidelity race car simulation using data from professional drivers, a domain characterized by extreme dynamics and narrow performance margins. The learned policies achieve competitive lap times while maintaining close alignment with expert driving behavior, outperforming baseline methods in both performance and imitation quality. Beyond standard benchmarks, we conduct human-grounded evaluation in a driver-in-the-loop simulator and show that the learned policies reproduce setup-dependent driving characteristics consistent with the feedback of top-class professional race drivers. These results demonstrate that our method enables learning high-performance control policies that are both optimal and behavior-consistent, and can serve as reliable surrogates for human decision-making in complex control systems.

翻译：在机器人学中，学习与专家行为保持一致的、同时实现高性能的控制策略是一项基本挑战。强化学习能够发现高性能策略，但往往偏离理想的人类行为；而模仿学习则受限于示范质量，难以在专家数据基础上进一步提升。我们提出一种行为约束的强化学习框架，该框架在超越示范表现的同时，能够显式控制与专家行为的偏差。由于动态控制中与专家一致的行为本质上是轨迹层面的，我们引入了一种滚动时域预测机制，该机制对短期未来轨迹进行建模，并在训练过程中提供前瞻奖励。为了考虑人类行为在扰动和状态变化下的自然变异性，我们进一步将策略基于参考轨迹进行条件设定，使其能够表征专家一致行为的分布，而非单一的确定性目标。在实验层面，我们基于职业赛车手数据进行高保真赛车模拟环境下的评估，该领域以极端动力学和狭窄的性能裕度为特征。学习到的策略在保持与专家驾驶行为高度一致的同时，实现了具有竞争力的单圈时间，在性能和模仿质量上均优于基线方法。除标准基准测试外，我们还在驾驶员在环模拟器中进行了人类实证评估，结果表明学习到的策略再现了与顶级职业赛车手反馈一致的、依赖设置参数的驾驶特征。这些结果证明，我们的方法能够学习到既最优又行为一致的高性能控制策略，并可作为复杂控制系统中人类决策的可靠替代方案。