Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.
翻译:中间训练已成为现代大语言模型开发的重要阶段,通过大规模精选混合数据在最终后训练前增强模型能力。其数据选择问题具有独特性:数据在接近预训练规模的条件下基于预训练风格的目标函数进行优化,但需同时兼顾下游能力导向、异构来源、不同格式及训练角色的多元特征。因此,有效选择需要兼顾可扩展性与源自适应语义准则。现有基于模型的方法虽具备良好扩展性,但仅能提供隐式质量信号。语义选择方法能提供更强判断力,但通常假设固定评价准则或标准化数据格式。为解决这一矛盾,我们提出MIRA——一种基于自锚定准则发现的源感知过滤框架。核心思想是将准则构建纳入数据选择流程:MIRA首先识别各源组需要评估的维度,随后将这些判断蒸馏为可扩展的学生评分器,用于全语料库过滤。在包含21个数据源和5个源组的代码领域中间训练中,MIRA在九项代码基准测试上均优于基线选择方法,且仅使用半数token即可达到全语料库训练效果。