An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with $n$ states and $m$ actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly $n \log m$ bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is $n \log m$ bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model'' necessary for optimality.
翻译:人工智能领域的一个重要问题是:成功行为在多大程度上需要内部世界表征。本工作中,我们量化了最优策略所提供的关于底层环境的信息量。我们考虑一个具有$n$个状态和$m$个动作的受控马尔可夫过程(CMP),并假设在可能的转移动态空间上存在均匀先验。我们证明:观察一个对任意非常值奖励函数均为最优的确定性策略,恰好传递$n \log m$比特关于环境的信息。具体而言,我们证明环境与最优策略之间的互信息为$n \log m$比特。该界限适用于广泛的目标类别,包括有限时域、无限时域折现及时均奖励最大化。这些发现为最优性所必需的"隐式世界模型"提供了精确的信息论下界。