Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($Δ$4.1% on eight ID tasks and $Δ$10.4% on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.
翻译:基于规则的推理被公认为推理的基本问题之一。尽管近期研究表明,大型推理模型(LRMs)通过强化学习(RL)显著增强了推理能力,但由于规则格式、类型和复杂度的多样性,实际应用仍面临严峻挑战。为缓解这一问题,我们提出了RuleReasoner——一种通过精心构建的大规模任务集合与新颖的领域感知动态采样方法实现高效规则推理的框架。具体而言,RuleReasoner通过基于历史奖励更新领域权重的方式对每个训练批次进行重采样。这种方法促进了强化学习中领域平衡与主动学习策略的协同优化,避免了人工设计的静态混合训练模式。在分布内(ID)与分布外(OOD)基准测试上的评估表明,RuleReasoner显著优于前沿大型推理模型(在八项ID任务上超越OpenAI-o1模型Δ4.1%,在三项OOD任务上超越Δ10.4%)。值得注意的是,相较于现有方法,本方案还展现出更高的计算效率。