Markov decision processes (MDPs) are used to model a wide variety of applications ranging from game playing over robotics to finance. Their optimal policy typically maximizes the expected sum of rewards given at each step of the decision process. However, a large class of problems does not fit straightforwardly into this framework: Non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expected value of an arbitrary function of the rewards is maximized. Example functions include the maximum of the rewards or their mean divided by their standard deviation. In this work, we introduce a general mapping of NCMDPs to standard MDPs. This allows all techniques developed to find optimal policies for MDPs, such as reinforcement learning or dynamic programming, to be directly applied to the larger class of NCMDPs. Focusing on reinforcement learning, we show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems. Given our approach, we can improve both final performance and training time compared to relying on standard MDPs.
翻译:马尔可夫决策过程(MDPs)被广泛应用于从游戏博弈、机器人技术到金融领域的各类建模场景。其最优策略通常旨在最大化决策过程每一步所获奖励的期望总和。然而,存在大量问题无法直接纳入该框架:非累积马尔可夫决策过程(NCMDPs)不再以奖励和的期望值为优化目标,而是最大化奖励的任意函数的期望值。示例函数包括奖励序列的最大值,或均值除以标准差等。本研究提出将NCMDPs映射为标准MDPs的通用方法,这使得所有为MDPs开发的最优策略求解技术(如强化学习或动态规划)能直接应用于更广泛的NCMDPs类别。聚焦于强化学习技术,我们在经典控制、金融投资组合优化及离散优化问题等多样化任务中展示了应用案例。实验表明,相较于传统MDPs方法,本方法能同时提升最终性能与训练效率。