Reinforcement learning (RL) problems where the learner attempts to infer an unobserved reward from some feedback variables have been studied in several recent papers. The setting of Interaction-Grounded Learning (IGL) is an example of such feedback-based RL tasks where the learner optimizes the return by inferring latent binary rewards from the interaction with the environment. In the IGL setting, a relevant assumption used in the RL literature is that the feedback variable $Y$ is conditionally independent of the context-action $(X,A)$ given the latent reward $R$. In this work, we propose Variational Information-based IGL (VI-IGL) as an information-theoretic method to enforce the conditional independence assumption in the IGL-based RL problem. The VI-IGL framework learns a reward decoder using an information-based objective based on the conditional mutual information (MI) between $(X,A)$ and $Y$. To estimate and optimize the information-based terms for the continuous random variables in the RL problem, VI-IGL leverages the variational representation of mutual information to obtain a min-max optimization problem. Also, we extend the VI-IGL framework to general $f$-Information measures leading to the generalized $f$-VI-IGL framework for the IGL-based RL problems. We present numerical results on several reinforcement learning settings indicating an improved performance compared to the existing IGL-based RL algorithm.
翻译:在强化学习(RL)问题中,学习者试图从某些反馈变量中推断未观测到的奖励,这一方向已在近期多项研究中得到探讨。交互基础学习(IGL)场景便是此类基于反馈的RL任务的典型例子:学习者通过与环境的交互推断潜在二元奖励,从而优化回报值。在IGL设置中,RL文献中采用的一个关键假设是:在给定潜在奖励$R$的条件下,反馈变量$Y$与上下文-动作对$(X,A)$条件独立。本文提出基于变分信息的IGL(VI-IGL)方法,作为一种信息论手段来强化IGL型RL问题中的条件独立性假设。VI-IGL框架通过基于$(X,A)$与$Y$之间条件互信息(MI)的信息论目标函数来学习奖励解码器。为估计和优化RL问题中连续随机变量的信息论项,VI-IGL利用互信息的变分表示构建最小-最大优化问题。此外,我们将VI-IGL框架扩展到更一般的$f$-信息度量,从而为IGL型RL问题建立广义的$f$-VI-IGL框架。我们在多个强化学习场景下的数值结果表明,该方法的性能相较于现有基于IGL的RL算法有显著提升。