Future sequence represents the outcome after executing the action into the environment (i.e. the trajectory onwards). When driven by the information-theoretic concept of mutual information, it seeks maximally informative consequences. Explicit outcomes may vary across state, return, or trajectory serving different purposes such as credit assignment or imitation learning. However, the inherent nature of incorporating intrinsic motivation with reward maximization is often neglected. In this work, we propose a policy iteration scheme that seamlessly incorporates the mutual information, ensuring convergence to the optimal policy. Concurrently, a variational approach is introduced, which jointly learns the necessary quantity for estimating the mutual information and the dynamics model, providing a general framework for incorporating different forms of outcomes of interest. While we mainly focus on theoretical analysis, our approach opens the possibilities of leveraging intrinsic control with model learning to enhance sample efficiency and incorporate uncertainty of the environment into decision-making.
翻译:未来序列表示在环境中执行动作后的结果(即后续轨迹)。当以信息论中的互信息概念为驱动时,它追求最具信息量的后果。显式结果可能因状态、回报或轨迹而异,服务于不同目的(如信用分配或模仿学习)。然而,将内在动机与奖励最大化相结合的内在特性往往被忽视。在本工作中,我们提出了一种策略迭代方案,该方案无缝整合了互信息,确保收敛到最优策略。同时,引入了一种变分方法,联合学习估计互信息所需的量以及动力学模型,为整合不同形式的目标结果提供了通用框架。尽管我们主要聚焦于理论分析,但我们的方法开辟了利用基于模型学习的内在控制来提升样本效率,并将环境不确定性纳入决策过程的可能性。