Reinforcement learning has emerged as a dominant technique for fine-tuning the behavior of large language models, with policy optimization (PO) algorithms such as GRPO, DAPO, and Dr. GRPO emerging in rapid succession to advance state-of-the-art reasoning and alignment performance. However, the modular differences between these algorithms, including targeted improvements to clipping, advantage estimation, and reward aggregation, are introduced across separate papers with inconsistent notation, making them difficult to compare and intimidating to the non-expert community. We present UNIPO, the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design. UNIPO connects three complementary views, a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison, allowing learners to observe how individual design decisions propagate through training. Through two usage scenarios, we demonstrate how UNIPO supports both classroom instruction for non-experts and algorithm selection for AI practitioners. Our tool is open-source and publicly available at https://poloclub.github.io/unipo.
翻译:强化学习已成为微调大型语言模型行为的主导技术,其中策略优化算法如GRPO、DAPO和Dr. GRPO相继涌现,推动了推理与对齐性能的最新进展。然而,这些算法在模块化差异方面——包括对裁剪、优势估计和奖励聚合的针对性改进——因分散于不同论文且符号表示不一致,导致难以比较,并对非专家群体形成理解门槛。我们提出UNIPO,首个通过统一设计揭示RL微调算法令牌级训练动态的交互式可视化工具。UNIPO连接三个互补视图:高层训练概览、步骤级提示与响应检查器,以及算法并排比较,使学习者能够观察个体设计决策如何在训练过程中传播。通过两个使用场景,我们展示了UNIPO如何支持非专家的课堂指导以及AI从业者的算法选择。本工具为开源软件,公开获取于https://poloclub.github.io/unipo。