While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.
翻译:尽管大语言模型在自然语言处理领域展现出巨大潜力,但需要多步骤逻辑、规划与验证的复杂通用推理仍是关键瓶颈。虽然基于可验证奖励的强化学习已在特定领域取得成功,但通用推理领域仍缺乏大规模、高质量且难度校准的数据。为此,我们提出UltraLogic框架,该框架通过基于代码的求解方法将问题的逻辑核心与其自然语言表达解耦,从而实现高质量数据的自动化生成。该框架包含数百种独特任务类型,并建立了跨越十个难度级别的自动校准流程。此外,为缓解二元奖励稀疏性与非负奖励陷阱问题,我们引入双极浮点奖励机制,利用分级惩罚有效区分完美响应与存在逻辑缺陷的响应。实验表明,任务多样性是提升推理能力的主要驱动力,而双极浮点奖励结合难度匹配策略能显著提升训练效率,引导模型趋向全局逻辑最优解。