Reinforcement Learning from Verifiable Rewards (RLVR) is bottlenecked by data: existing synthesis pipelines rely on expert-written code or fixed templates, confining growth to instance-level perturbations. We shift the evolvable unit from problem instances to task-family specifications. SSLogic is an agentic meta-synthesis framework in which LLM agents iteratively author and refine executable Generator-Validator pairs inside a closed Generate-Validate-Refine loop, producing families with new rules and difficulty gradients rather than parameter variations of old ones. A Multi-Gate Validation Protocol -- multi-strategy consensus plus Adversarial Blind Review, where independent agents solve each instance by writing and executing code -- filters ill-posed tasks before they enter training. Starting from 400 seed families, two evolution rounds yield 953 families and 21,389 verifiable instances. Three converging comparisons (step-matched, token-matched, and size-controlled on external Enigmata data) consistently show higher training utility of evolved data, with gains of SynLogic +5.2, AIME25 +3.0, and BBH +5.5 on Enigmata. Fine-grained KORBench evaluation reveals selective improvements in logic (+13.2%) and operation (+9.6%), linking structural evolution to downstream gains. Code: https://github.com/AdAstraAbyssoque/Scaling-the-Scaling-Logic
翻译:基于可验证奖励的强化学习(RLVR)受限于数据瓶颈:现有合成流程依赖专家编写的代码或固定模板,将增长限制在实例层面的扰动中。我们将可演化单元从问题实例转移到任务族规范上。SSLogic是一种智能体元综合框架,其中LLM智能体在封闭的"生成-验证-优化"循环中迭代编写并优化可执行的生成器-验证器对,从而产生具有新规则和难度梯度(而非旧有规则的参数变体)的任务族。多门验证协议——通过多策略共识与对抗性盲审(独立智能体通过编写和执行代码求解每个实例)——在低质量任务进入训练前即将其过滤。从400个种子任务族出发,经过两轮演化后,我们获得953个任务族和21,389个可验证实例。三项收敛性对比(步长匹配、词元匹配及在外部Enigmata数据上的规模控制实验)一致表明,演化数据具有更高的训练效用,在Enigmata数据集上SynLogic提升5.2%、AIME25提升3.0%、BBH提升5.5%。细粒度KORBench评估揭示了逻辑(+13.2%)和运算(+9.6%)领域的定向改进,将结构演化与下游性能提升关联起来。代码:https://github.com/AdAstraAbyssoque/Scaling-the-Scaling-Logic