Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
翻译:强化学习已被证明能显著提升大型语言模型在数学、代码生成及多跳推理任务中的推理能力。然而,强化学习微调需要大量高质量可验证数据,这些数据通常来源于人工标注、前沿大型语言模型生成或基于LLM的验证器评分。这三种方式均存在显著局限:人工标注数据集规模小且标注成本高昂;LLM生成的数据易产生幻觉且代价昂贵;基于LLM的验证器则存在准确率低、速度慢的问题。本研究探索了一种更经济的替代方案:在多跳推理任务中使用规则生成的合成数据进行强化学习微调。我们发现,尽管合成数据仅包含虚构知识,但基于其微调的LLM在主流现实世界问答基准测试中表现显著更优。通过按问题难度分层分析性能,我们发现合成数据能教会LLM进行知识组合——这是一种基础且可泛化的推理技能。本研究凸显了规则生成的合成推理数据作为免费可扩展资源对于提升LLM推理能力的重要价值。