Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.
翻译:基于局部预测误差的好奇心奖励仅关注当前转移,而未考虑世界模型在所有已访问转移上的累积预测误差。我们提出好奇-批评方法,将其内在奖励建立在累积预测误差改进目标上,并证明其存在一个可处理的每步替代量:当前预测误差与当前状态转移的渐近误差基线之差。我们通过一个与世界模型协同训练的学习批评家在线估计此误差基线;回归单个标量,批评家在模型饱和前即可良好收敛,从而无需噪声基底先验知识即可将探索导向可学习转移。该奖励对可学习转移更高,而对随机转移则坍塌至误差基线,从而在线有效分离认知(可缩减)预测误差与偶然(不可缩减)预测误差。从Schmidhuber(1991)到学习特征空间变体的先验预测误差好奇心公式,均作为此误差基线特定近似形式的特例出现。在随机网格世界上的实验表明,好奇-批评在训练速度和最终世界模型精度上均优于预测误差、访问计数和随机网络蒸馏方法。