Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further intensifies the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM's efficiency across diverse model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. A distinctive feature of information capacity is its incorporation of tokenizer efficiency, which affects inference costs but is often neglected in LLM evaluations. We assess the information capacity of 52 open-source models and observe a consistent information capacity among different-sized models within a series. Experiments on 5 heterogeneous datasets reveal strong linguistic bias in mainstream LLMs. Three major factors of information capacity include tokenizer efficiency, pretraining data, and the mixture-of-experts architecture. Empirical results verify the accuracy of performance prediction across model sizes based on information capacity and show the correlation between information capacity and benchmark scores.
翻译:近年来,大型语言模型(LLMs)快速发展,应用领域不断扩展,导致对计算资源的需求急剧增长。测试时扩展的广泛采用进一步加剧了模型能力与资源消耗之间的紧张关系,凸显了推理效率的重要性。然而,目前仍缺乏一个统一的指标来准确反映不同模型规模和架构下LLM的效率。受压缩与智能之间相关性的启发,我们提出了信息容量这一指标,它基于文本压缩性能相对于计算复杂度的表现来衡量模型效率。信息容量的一个显著特点是其包含了分词器效率,该因素影响推理成本但在LLM评估中常被忽视。我们评估了52个开源模型的信息容量,并观察到同一系列中不同规模模型的信息容量具有一致性。在5个异构数据集上的实验揭示了主流LLMs存在强烈的语言偏见。影响信息容量的三个主要因素包括分词器效率、预训练数据和专家混合架构。实证结果验证了基于信息容量进行跨模型规模性能预测的准确性,并显示了信息容量与基准分数之间的相关性。