This report presents a solution for the swing-up and stabilisation tasks of the acrobot and the pendubot, developed for the AI Olympics competition at IROS 2024. Our approach employs the Average-Reward Entropy Advantage Policy Optimization (AR-EAPO), a model-free reinforcement learning (RL) algorithm that combines average-reward RL and maximum entropy RL. Results demonstrate that our controller achieves improved performance and robustness scores compared to established baseline methods in both the acrobot and pendubot scenarios, without the need for a heavily engineered reward function or system model. The current results are applicable exclusively to the simulation stage setup.
翻译:本报告针对IROS 2024人工智能奥林匹克竞赛中的Acrobot与Pendubot摆起稳定任务提出解决方案。我们采用平均奖励熵优势策略优化算法(AR-EAPO),这是一种结合平均奖励强化学习与最大熵强化学习的无模型强化学习算法。实验结果表明,在Acrobot和Pendubot两种场景中,我们的控制器相较于现有基线方法均获得了更优的性能与鲁棒性评分,且无需精心设计奖励函数或系统模型。当前结果仅适用于仿真阶段设定。