Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.
翻译:大型语言模型(LLM)在生成链式思维(CoT)推理时,常产生不必要的冗长内容,导致计算成本和延迟增加,而性能提升却不显著。本文提出细粒度分组策略优化(FGO),一种强化学习(RL)算法,通过将分组响应进一步细分,并依据长度和熵分配适当权重,从而实现有效的CoT压缩。同时,作为分组相对策略优化(GRPO)的增强变体,FGO成功解决了GRPO的两大主要局限:数据利用率低和熵崩溃。我们在多个推理型LLM和基准测试(包括MATH500、AIME24、AMC23和Minerva)上评估了FGO。实验结果表明,FGO在实现高效CoT压缩的同时未降低性能,并同步解决了GRPO的关键局限。