A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate a Bellman-style equational characterization of optimality in this context and develop a model-free reinforcement learning approach to obtain optimal policies.
翻译:传统强化学习的一个基本假设是,智能体获得的奖励的价值在获取后不会发生变化。本研究放弃这一假设,考虑奖励的价值随时间推移按比例衰减的情形。强调支付时刻发生的变化点,我们使用"资产"一词指代智能体当前拥有的奖励。采用这一术语,我们开启了无限期定量优化框架下折旧资产的研究。具体而言,我们提出了一种受经典指数贴现启发的资产折旧概念,即智能体获取资产后,其价值在每个时间步均按固定折扣因子缩放。我们在此背景下建立了贝尔曼形式的最优性方程刻画,并开发了一种无模型强化学习方法以获取最优策略。