Reinforcement learning (RL) can be formulated as a sequence modeling problem, where models predict future actions based on historical state-action-reward sequences. Current approaches typically require long trajectory sequences to model the environment in offline RL settings. However, these models tend to over-rely on memorizing long-term representations, which impairs their ability to effectively attribute importance to trajectories and learned representations based on task-specific relevance. In this work, we introduce AdaCred, a novel approach that represents trajectories as causal graphs built from short-term action-reward-state sequences. Our model adaptively learns control policy by crediting and pruning low-importance representations, retaining only those most relevant for the downstream task. Our experiments demonstrate that AdaCred-based policies require shorter trajectory sequences and consistently outperform conventional methods in both offline reinforcement learning and imitation learning environments.
翻译:强化学习(RL)可被表述为序列建模问题,其中模型根据历史状态-动作-奖励序列预测未来动作。当前方法通常需要在离线RL设置中使用长轨迹序列来建模环境。然而,这些模型往往过度依赖记忆长期表征,这削弱了其根据任务相关性有效归因轨迹与学习表征重要性的能力。本文提出AdaCred,一种新颖方法,将轨迹表示为基于短期动作-奖励-状态序列构建的因果图。我们的模型通过信用分配与剪枝低重要性表征来自适应学习控制策略,仅保留与下游任务最相关的表征。实验表明,基于AdaCred的策略所需轨迹序列更短,且在离线强化学习与模仿学习环境中均持续优于传统方法。