Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.
翻译:基于评分标准的奖励塑形方法通过强化学习(RL)为微调大型语言模型提供了可解释且可编辑的奖励信号,但现有自适应评分标准方法通常基于局部证据(如当前批次或实例级比较)更新准则。这种局部视角丢弃了训练过程中产生的诊断信息,难以追踪重复性失败、评估先前的评分标准修订,或在早期准则饱和后提升标准。我们提出AMARIS——一种基于记忆增强的评分标准改进系统,将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中,随后检索近期及语义相关的历史信息以修订评分标准。我们在科学、医学、指令遵循和创意写作领域,分别在全局和实例特定评分标准设置下评估了AMARIS。相比静态、局部自适应及记忆消融基线模型,AMARIS在最强基线基础上实现了GPQA-Diamond提升2.8分、IFBench提升2.2分的改进,同时分析表明记忆功能减少了评分标准的振荡式修订,并支持从早期错误修正到后续课程进阶的渐进过程。AMARIS与常规RL循环异步运行,相比同步评分标准更新降低了阻塞延迟。