Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflates ``thinking longer'' with ``thinking better''}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on \href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.
翻译:为逐步推理训练的大型语言模型(LLMs)常变得过于冗长,增加了推理成本。标准的可验证奖励强化学习(RLVR)流程为提升训练效率会过滤掉“简单”问题,使模型主要在需要更长推理链的困难问题上训练。这导致输出长度分布向上偏移,产生一个**将“思考更久”与“思考更好”混为一谈的模型**。本工作中,我们证明保留并适度加权中等难度问题可作为一种隐式长度正则化器。让模型接触可解的短链任务能约束其输出分布,防止冗长失控。其结果是**免费获得的涌现简洁性**:模型学会解决更难问题而不增加输出长度,**尽管未使用任何显式的长度惩罚**。在\textit{Qwen3-4B-Thinking-2507}(具有16k词元限制)上使用此方法进行的RLVR实验,在保持基线pass@1 AIME25准确率的同时,生成的解决方案平均长度缩短近一半。代码发布于\href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub},数据集与模型发布于\href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}。