We evaluate benchmark deep reinforcement learning (DRL) algorithms on the task of portfolio optimisation under a simulator. The simulator is based on correlated geometric Brownian motion (GBM) with the Bertsimas-Lo (BL) market impact model. Using the Kelly criterion (log utility) as the objective, we can analytically derive the optimal policy without market impact and use it as an upper bound to measure performance when including market impact. We found that the off-policy algorithms DDPG, TD3 and SAC were unable to learn the right Q function due to the noisy rewards and therefore perform poorly. The on-policy algorithms PPO and A2C, with the use of generalised advantage estimation (GAE), were able to deal with the noise and derive a close to optimal policy. The clipping variant of PPO was found to be important in preventing the policy from deviating from the optimal once converged. In a more challenging environment where we have regime changes in the GBM parameters, we found that PPO, combined with a hidden Markov model (HMM) to learn and predict the regime context, is able to learn different policies adapted to each regime. Overall, we find that the sample complexity of these algorithms is too high, requiring more than 2m steps to learn a good policy in the simplest setting, which is equivalent to almost 8,000 years of daily prices.
翻译:我们针对模拟器中的投资组合优化任务,对基准深度强化学习(DRL)算法进行了评估。该模拟器基于带有Bertsimas-Lo(BL)市场冲击模型的相关几何布朗运动(GBM)。以凯利准则(对数效用函数)为目标函数,我们能够解析推导出无市场冲击时的最优策略,并将其作为衡量包含市场冲击时性能的上界。研究发现,由于噪声奖励的影响,离策略算法DDPG、TD3和SAC无法学习到正确的Q函数,因此表现较差。采用广义优势估计(GAE)的在线策略算法PPO和A2C能够有效处理噪声,并得出接近最优的策略。实验发现,PPO的裁剪变体对于防止策略在收敛后偏离最优值至关重要。在更具挑战性的环境中,当GBM参数发生状态转换时,结合隐马尔可夫模型(HMM)来学习并预测状态背景的PPO,能够针对不同状态学习出相应的适应性策略。总体而言,我们发现这些算法的样本复杂度过高:在最简单的设置中,需要超过200万步才能学习到良好策略,这相当于近8000年的日频价格数据。