In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent, helping the agent maximize cumulative rewards to obtain the optimal policy. However, in many real-world scenarios, immediate reward signals are not obtainable; instead, agents receive a single reward that is contingent upon a partial sequence or a complete trajectory. In this work, we define this challenging problem as Reinforcement Learning from Bagged Reward (RLBR), where sequences of data are treated as bags with non-Markovian bagged rewards. We provide a theoretical study to establish the connection between RLBR and standard RL in Markov Decision Processes (MDPs). To effectively explore the reward distributions within these bags and enhance policy training, we propose a Transformer-based reward model, the Reward Bag Transformer, which employs a bidirectional attention mechanism to interpret contextual nuances and temporal dependencies within each bag. Our empirical evaluations reveal that the challenge intensifies as the bag length increases, leading to the performance degradation due to reduced informational granularity. Nevertheless, our approach consistently outperforms existing methods, demonstrating the least decline in efficacy across varying bag lengths and excelling in approximating the original MDP's reward distribution.
翻译:在强化学习(RL)中,通常假设智能体每执行一个动作都会产生一个即时奖励信号,以帮助智能体最大化累积奖励,从而获得最优策略。然而,在许多现实场景中,无法获得即时奖励信号;相反,智能体仅获得一个依赖于部分序列或完整轨迹的单一奖励。在本研究中,我们将这一具有挑战性的问题定义为基于袋装奖励的强化学习(RLBR),其中数据序列被视为具有非马尔可夫性袋装奖励的“袋”。我们通过理论分析建立了RLBR与标准马尔可夫决策过程(MDP)中强化学习之间的联系。为了有效探索这些袋内的奖励分布并增强策略训练,我们提出了一种基于Transformer的奖励模型——奖励袋Transformer,该模型采用双向注意力机制来解析每个袋内的上下文细微差异和时间依赖性。我们的实证评估表明,随着袋长度的增加,挑战加剧,导致因信息粒度降低而性能下降。尽管如此,我们的方法始终优于现有方法,在不同袋长度下表现出最小的效能下降,并在逼近原始MDP奖励分布方面表现优异。