This paper presents a novel RL algorithm, S-REINFORCE, which is designed to generate interpretable policies for dynamic decision-making tasks. The proposed algorithm leverages two types of function approximators, namely Neural Network (NN) and Symbolic Regressor (SR), to produce numerical and symbolic policies, respectively. The NN component learns to generate a numerical probability distribution over the possible actions using a policy gradient, while the SR component captures the functional form that relates the associated states with the action probabilities. The SR-generated policy expressions are then utilized through importance sampling to improve the rewards received during the learning process. We have tested the proposed S-REINFORCE algorithm on various dynamic decision-making problems with low and high dimensional action spaces, and the results demonstrate its effectiveness and impact in achieving interpretable solutions. By leveraging the strengths of both NN and SR, S-REINFORCE produces policies that are not only well-performing but also easy to interpret, making it an ideal choice for real-world applications where transparency and causality are crucial.
翻译:本文提出了一种新颖的强化学习算法S-REINFORCE,旨在为动态决策任务生成可解释策略。该算法利用神经网络(NN)和符号回归器(SR)两类函数逼近器,分别生成数值化策略与符号化策略。神经网络组件通过策略梯度学习产生动作的数值概率分布,而符号回归器组件则捕捉状态与动作概率关联的函数形式。随后通过重要性采样利用SR生成的策略表达式,提升学习过程中的奖励获取效果。我们已在低维和高维动作空间的多种动态决策问题上测试了所提出的S-REINFORCE算法,结果验证了其在实现可解释解时的有效性与影响力。通过融合NN与SR的双重优势,S-REINFORCE生成的策略不仅性能优异且易于解释,使其成为透明性和因果性至关重要的实际应用场景的理想选择。