Reinforcement learning (RL) can be highly effective at learning goal-reaching policies, but it typically does not provide formal guarantees that the goal will always be reached. A common approach to provide formal goal-reaching guarantees is to introduce a shielding mechanism that restricts the agent to actions that satisfy predefined safety constraints. The main challenge here is integrating this mechanism with RL so that learning and exploration remain effective without becoming overly conservative. Hence, this paper proposes an RL-based control framework that provides formal goal-reaching guarantees for wheeled mobile robots operating in unstructured environments. We first design a real-time RL policy with a set of 15 carefully defined reward terms. These rewards encourage the robot to reach both static and dynamic goals while generating sufficiently smooth command signals that comply with predefined safety specifications, which is critical in practice. Second, a Lyapunov-like stabilizer layer is integrated into the benchmark RL framework as a policy supervisor to formally strengthen the goal-reaching control while preserving meaningful exploration of the state action space. The proposed framework is suitable for real-time deployment in challenging environments, as it provides a formal guarantee of convergence to the intended goal states and compensates for uncertainties by generating real-time control signals based on the current state, while respecting real-world motion constraints. The experimental results show that the proposed Lyapunov-like stabilizer consistently improves the benchmark RL policies, boosting the goal-reaching rate from 84.6% to 99.0%, sharply reducing failures, and improving efficiency.
翻译:强化学习(RL)能够高效学习目标到达策略,但通常无法提供目标必定可达的正式保证。提供正式目标到达保证的常见方法是引入屏蔽机制,将智能体的动作限制在满足预定义安全约束的范围内。此处的主要挑战在于将该机制与强化学习相结合,使得学习与探索在不过度保守的前提下保持高效。为此,本文提出一种基于强化学习的控制框架,为在非结构化环境中运行的轮式移动机器人提供正式的目标到达保证。我们首先设计了一个包含15项精心定义的奖励项的实时强化学习策略。这些奖励项激励机器人同时到达静态与动态目标,同时生成足够平滑且符合预定义安全规范的指令信号,这在实践中至关重要。其次,我们将一个Lyapunov类稳定器层集成到基准强化学习框架中,作为策略监督器,以正式强化目标到达控制,同时保持对状态动作空间的有意义探索。所提出的框架适用于在挑战性环境中进行实时部署,因为它提供了收敛到预期目标状态的正式保证,并通过基于当前状态生成实时控制信号来补偿不确定性,同时遵循现实世界的运动约束。实验结果表明,所提出的Lyapunov类稳定器持续改进了基准强化学习策略,将目标到达率从84.6%提升至99.0%,显著减少了失败案例,并提高了效率。