We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent's current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.
翻译:本文提出策略感知惊奇度(SuS),一种新颖的内在动机框架,利用前后预测失配作为强化学习探索的新颖性信号。与仅依赖状态预测误差的传统好奇心驱动方法不同,SuS引入了两个互补组件:策略稳定性(SS)与策略惊奇度(SuS)。SS衡量时序步骤间行为策略的一致性,而SuS则捕捉智能体当前策略表征下的意外结果。我们通过习得的加权系数构建了融合两种信号的奖励公式。我们在基于大语言模型的数学推理任务上评估SuS,结果表明其在准确性和解多样性方面均有显著提升。消融实验证实移除任一组件会导致至少10%的性能下降,验证了我们方法的协同特性。与基线方法相比,SuS在Pass@1指标上提升17.4%,在Pass@5指标上提升26.4%,同时在训练全程保持更高的策略多样性。