Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

翻译：近期关于大型语言模型（LLM）的研究日益聚焦于后训练和与精心策划的数据集的对齐，这些数据集旨在增强指令遵循、世界知识和专业技能。然而，主流开源和闭源LLM所使用的后训练数据集大多未向公众开放，其构建过程信息有限。这种透明度的缺失推动了近期开源后训练语料库的发展。尽管在这些开源替代方案上进行训练可以获得与领先模型相当的性能，但由于在大规模严格执行时计算成本高昂，系统性比较仍具有挑战性，因此此类比较在很大程度上仍属空白。因此，在评估数据质量时，具体样本、任务类型或策划策略如何影响下游性能仍不明确。在本研究中，我们对两个重要的开源后训练数据集——Tulu-3-SFT-Mix和SmolTalk——进行了首次全面的并行分析。利用Magpie框架，我们为每个样本标注了详细的质量指标，包括轮次结构（单轮 vs. 多轮）、任务类别、输入质量和响应质量，并推导出揭示两个数据集之间结构性和质性异同的统计信息。基于这些洞察，我们设计了一个原则性的策划方案，生成了新的数据混合体TuluTalk，该混合体比任一源数据集少14%的样本，同时在关键基准测试中达到或超越了它们的性能。我们的发现为构建更有效的后训练数据集提供了可操作的见解，以在实用资源限制内提升模型性能。为支持未来研究，我们公开发布了标注的源数据集和我们策划的TuluTalk混合体。