Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
翻译:理解数据压缩与大型语言模型(LLMs)能力之间的关系至关重要,尤其是在代码智能等专业领域。先前的研究假定压缩与通用智能之间存在线性关系,但忽视了代码涵盖多种编程语言和任务的多面性,且难以对现代代码LLMs进行公平评估。为解决这一问题,我们在涵盖多语言、多任务的全面代码基准测试上评估了多种开源代码LLMs。为应对预训练LLM代码智能的高效公平评估挑战,我们引入了**格式退火(Format Annealing)**——一种轻量级、透明的训练方法,旨在公平评估这些预训练模型的内在能力。压缩效率以每字符比特数(BPC)衡量,通过一个从GitHub获取的大规模、前所未见的新型代码验证集确定。实验结果表明,测得的代码智能与BPC之间存在根本的对数关系。这一发现修正了此前关于线性关系的假设——我们推测该假设可能仅在特定有限条件下观测到了对数曲线的尾部。本研究深化了对压缩在代码智能发展中作用的认知,并为代码领域贡献了可靠的评估框架。