Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
翻译:充分根据行动对未来结果的贡献程度进行信用分配,是强化学习中一个长期存在的开放性挑战。最常用的信用分配方法在决策效果并非即刻显现的任务中,其假设处于不利地位。此外,该方法只能评估智能体已选择的行动,导致效率极低。然而,该领域尚未广泛采用替代方法。事后信用分配是一种有前景但仍未充分探索的候选方案,旨在同时解决长期信用分配与反事实信用分配的问题。在本论文中,我们通过实证研究事后信用分配,以明确其主要优势及改进关键点。随后,我们将其应用于因子化状态表示,特别是基于环境因果结构的状态表示。在此设定下,我们提出一种能有效利用给定因果结构的事后信用分配变体。研究表明,我们的改进大幅降低了事后信用分配的工作负载,使其更高效,并能在多种任务中优于基线信用分配方法。这为其他基于给定或学习到的因果结构的方法开辟了道路。