分布对齐序列蒸馏：实现卓越的长链思维推理 (Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning)

In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.

翻译：本报告介绍了DASD-4B-Thinking，一个轻量级但能力出众、完全开源的推理模型。在数学、科学推理和代码生成等具有挑战性的基准测试中，它在同等规模的开源模型中取得了最先进的性能——甚至超越了若干更大规模的模型。我们首先批判性地重新审视了社区中广泛采用的蒸馏范式：基于教师生成响应的监督微调，亦称为序列级蒸馏。尽管遵循此方案的一系列近期工作展现了显著的效率和强大的实证性能，但它们主要基于监督微调的视角。因此，这些方法主要侧重于设计启发式规则进行监督微调数据过滤，而很大程度上忽视了蒸馏本身的核心原则——使学生模型能够学习教师的完整输出分布，从而继承其泛化能力。具体而言，我们识别出当前实践中的三个关键局限：i) 教师序列级分布的表征不足；ii) 教师输出分布与学生学习能力之间的错位；以及iii) 教师强制训练与自回归推理之间产生的曝光偏差。总之，这些缺陷反映了蒸馏过程中系统性缺乏明确的师生交互，使得蒸馏的本质未得到充分挖掘。为解决这些问题，我们提出了若干方法学创新，共同构成一个增强的序列级蒸馏训练流程。值得注意的是，DASD-4B-Thinking仅使用44.8万个训练样本就获得了有竞争力的结果——这比大多数现有开源工作所使用的数据量少一个数量级。为支持社区研究，我们公开发布了我们的模型和训练数据集。