Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration. Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost. Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM). During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions. We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration. We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG. We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari. Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari. This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration. For code to reproduce our experiments, see https://github.com/Akuna23Matata/LPM_exploration

翻译：当环境中存在不可学习的随机性来源（噪声-TV）时，朴素的内在奖励驱动探索代理会陷入该随机源并导致探索失败。基于不确定性估计或分布相似性的内在奖励方法虽能随时间推移最终摆脱噪声-TV，但存在样本效率低下和计算成本高昂的问题。受神经科学近期发现——人类在探索中会监测自身进步——的启发，我们提出了一种名为学习进度监测（LPM）的新型内在动机探索方法。在探索过程中，LPM奖励模型改进而非预测误差或新颖性，有效激励代理观察可学习的状态转移而非不可学习的状态转移。我们引入双网络设计：使用误差模型预测上一轮动力学模型的期望预测误差，并利用当前轮次与上一轮次模型误差之差来指导探索。理论证明LPM的内在奖励是零等变的，且是信息增益（IG）的单调指标，而误差模型是实现与IG单调对应关系的必要条件。我们在基于MNIST、160×120 RGB输入的三维迷宫和Atari的噪声环境中，将LPM与当前最先进基线方法进行了实证比较。结果表明，在迷宫实验中LPM的内在奖励收敛更快、探索更多状态，并在Atari中取得更高外在奖励。这种概念简洁的方法标志着噪声鲁棒探索范式的转变。复现实验代码见https://github.com/Akuna23Matata/LPM_exploration