Reward machines inform reinforcement learning agents about the reward structure of the environment and often drastically speed up the learning process. However, reward machines only accept Boolean features such as robot-reached-gold. Consequently, many inherently numeric tasks cannot profit from the guidance offered by reward machines. To address this gap, we aim to extend reward machines with numeric features such as distance-to-gold. For this, we present two types of reward machines: numeric-Boolean and numeric. In a numeric-Boolean reward machine, distance-to-gold is emulated by two Boolean features distance-to-gold-decreased and robot-reached-gold. In a numeric reward machine, distance-to-gold is used directly alongside the Boolean feature robot-reached-gold. We compare our new approaches to a baseline reward machine in the Craft domain, where the numeric feature is the agent-to-target distance. We use cross-product Q-learning, Q-learning with counter-factual experiences, and the options framework for learning. Our experimental results show that our new approaches significantly outperform the baseline approach. Extending reward machines with numeric features opens up new possibilities of using reward machines in inherently numeric tasks.
翻译:奖励机告知强化学习智能体关于环境的奖励结构,并通常能显著加速学习过程。然而,奖励机仅接受布尔型特征,例如“机器人到达金矿”。因此,许多本质上是数值的任务无法受益于奖励机提供的引导。为弥补这一不足,我们旨在将奖励机扩展至支持数值型特征,例如“到金矿的距离”。为此,我们提出了两类奖励机:数值-布尔型和数值型。在数值-布尔型奖励机中,“到金矿的距离”通过两个布尔特征——“到金矿的距离减小”和“机器人到达金矿”——来模拟。而在数值型奖励机中,“到金矿的距离”直接与布尔特征“机器人到达金矿”一同使用。我们在Craft领域中将新方法与基线奖励机进行比较,其中数值特征为智能体到目标的距离。我们采用叉积Q学习、基于反事实经验的Q学习以及选项框架进行学习。实验结果表明,我们的新方法显著优于基线方法。将奖励机扩展至数值特征为在本质数值型任务中使用奖励机开辟了新的可能性。