Large language models (LLMs) perform strongly on many language tasks but still struggle with complex multi-step reasoning across disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents to generate multidisciplinary questions. The central insight is the notion of Design Logic, a form of reusable meta-knowledge that encapsulates the structured process human experts use to transform knowledge into complex exam questions, enabling LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts with explicit control over difficulty, diversity, and question types. We use LLMs to reverse-engineer and abstract over 120,000 Design Logics from existing questions across various disciplines. By designing a two-stage retrieve-and-generate mechanism to match these Design Logics with raw corpus, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. Supervised fine-tuning (SFT) on Qwen3 and Llama3 with our data substantially improves multidisciplinary reasoning and outperforms baseline datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training.
翻译:大语言模型(LLM)在许多语言任务上表现优异,但在跨学科的复杂多步推理方面仍存在困难。现有的推理数据集往往缺乏学科广度、推理深度和多样性,同时也缺少问题合成的指导原则。我们提出了DESIGNER:一种基于设计逻辑引导的推理数据合成流程,利用自然可获取的大规模原始文档来生成多学科问题。其核心思想是“设计逻辑”这一概念——一种可复用的元知识,它封装了人类专家将知识转化为复杂考题的结构化过程,使得LLM能够从完全不同的源文本中生成具有相同复杂推理模式的新问题,并实现对难度、多样性和问题类型的显式控制。我们使用LLM从多个学科的现有问题中逆向工程并抽象出超过12万条设计逻辑。通过设计一个两阶段的检索-生成机制,将这些设计逻辑与原始语料库进行匹配,我们合成了两个涵盖75个学科的大规模推理数据集:DLR-Book(从书籍语料库生成的304万个问题)和DLR-Web(从网络语料库生成的166万个问题)。数据分析表明,与基线数据集中的问题相比,我们方法合成的问题表现出更高的难度和多样性。在Qwen3和Llama3模型上使用我们的数据进行监督微调(SFT),显著提升了多学科推理能力,并超越了使用基线数据集的效果。值得注意的是,仅使用我们的数据对这些模型的基座版本进行SFT,其性能甚至超越了经过完整后训练的官方最终模型。