SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). Validated on industrial-scale Mixture-of-Experts (MoE) models across varying context windows (32K/128K), our approach demonstrates superior robustness and predictive power. This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.
翻译:SWE-bench已成为评估大型语言模型在复杂软件工程任务上的首要基准。虽然这些能力主要是在中期训练阶段获得,并在监督微调阶段被激发,但目前仍严重缺乏能够有效指导中期训练的指标。标准指标如困惑度因"长上下文税"而失真,且与下游SWE性能的相关性较弱。本文通过引入严格的数据筛选策略来弥合这一差距。关键贡献在于提出熵压缩假说,将智能重新定义为非标量Top-1压缩能力,而是将不确定性结构化为低阶熵压缩状态的能力。基于这种细粒度熵分析,我们构建了新型指标HE-SNR。通过在工业级混合专家模型上验证,我们的方法在不同上下文窗口下均展现出卓越的鲁棒性与预测能力。本研究为优化LLM在复杂工程领域的潜在能力提供了理论基础与实践工具。