Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.
翻译:现有的大语言模型评估方法面临着数据污染、对提示的敏感性以及基准创建成本高昂等挑战。为解决此问题,我们提出一种基于无损压缩的评估方法,用于测试模型在训练截止点之后的预测能力泛化情况。具体而言,我们收集了2017至2023年间涵盖83个月的全面测试数据,并根据模型的训练数据截止日期将数据划分为训练期与测试期。我们测量两项指标:1) 测试期内的压缩性能,作为模型对未见数据泛化能力的度量;2) 训练期与测试期间性能差距,作为模型鲁棒性的度量。我们在维基百科、新闻文章、代码、arXiv论文及多模态数据等来源上,对14个不同规模的代表性大语言模型进行了实验。结果表明,许多模型在截止日期后压缩率显著下降,但Mistral与Llama-2等模型在性能与鲁棒性之间展现了良好平衡。结果还显示,模型在新闻与代码数据上泛化困难,但对arXiv论文的处理效果尤为出色。此外,上下文大小与分词实现方式对整体压缩性能有显著影响。