Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.
翻译:合成数据已成为训练代码生成模型的关键资源,但其引入的显著噪声与幻觉问题难以通过现有指标有效检测。当前主流的数据选择方法(如指令遵循难度)通常评估模型在给定查询时生成答案的困难程度($A|Q$)。然而,该指标在噪声合成数据上存在歧义:低概率可能同时反映任务固有复杂度与模型生成的幻觉。本文提出QAQ——一种创新的数据选择框架,通过反向维度评估数据质量:答案预测查询的能力($Q|A$)如何?我们定义反向互信息以量化在答案已知条件下对查询的信息增益。分析表明,RMI的两个极端均暗示质量问题:低RMI反映语义失配,而过高RMI可能包含大语言模型易识别的缺陷模式。此外,我们引入基于强弱模型分歧度的选择策略,以识别有效且具挑战性的样本。在WarriorCoder数据集上的实验表明,采用分层RMI策略仅选择25%数据即可达到全数据训练的同等性能,显著优于现有数据选择方法。本方法凸显了双向语义一致性在合成数据构建中的重要性,为在不牺牲模型能力的前提下降低计算成本提供了可扩展路径。