SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

翻译：基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLM）在数学与代码等形式化领域的推理能力。尽管取得这些进展，LLM在需要因果推断和时间理解等能力的通用推理任务上仍存在不足。将RLVR扩展到通用推理的根本限制在于缺乏覆盖多种推理技能的高质量可验证训练数据。为解决这一挑战，我们提出数据策展框架SUPERNOVA，旨在通过RLVR增强通用推理能力。核心洞察在于：包含专家标注真实答案的指令微调数据集编码了丰富的推理模式，这些模式可被系统性地适配至RLVR。为验证这一假设，我们开展了100余组受控强化学习实验，分析数据设计选择对下游推理性能的影响。具体而言，我们探究三个关键因素：（i）源任务选择、（ii）任务混合策略及（iii）提升数据质量的合成干预。分析表明，源任务选择具有非平凡性，对下游推理性能有显著影响；基于单个目标任务性能选择任务的策略优于依赖整体平均性能的策略。最终，基于SUPERNOVA训练的模型在BBEH、Zebralogic及MMLU-Pro等具有挑战性的推理基准上超越强基线（如Qwen3.5）。特别地，在不同模型规模上使用SUPERNOVA训练使BBEH性能相对提升最高达52.8%，验证了原则化数据策展对RLVR的有效性。我们的发现为扩展RLVR至通用推理提供了基于人工标注资源的数据策展实践指南。代码与数据已开源至https://github.com/asuvarna31/supernova。