LLMEval-3：关于大型语言模型稳健与公平评估的大规模纵向研究 (LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models) - 专知论文

会员服务 ·

0

稳健 · 污染 · 基准 · 一致 · 语言模型 ·

2025 年 12 月 3 日

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

翻译：LLMEval-3：关于大型语言模型稳健与公平评估的大规模纵向研究

Ming Zhang,Yujiong Shen,Jingyi Deng,Yuhui Wang,Yue Zhang,Junzhe Wang,Shichun Liu,Shihan Dou,Huayu Sha,Qiyuan Peng,Changhao Jiang,Jingqi Tong,Yilong Wu,Zhihao Zhang,Mingqi Wu,Zhiheng Xi,Mingxu Chai,Tao Gui,Qi Zhang,Xuanjing Huang

from arxiv, This work is withdrawn as all authors are not in agreement on the work

Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.

翻译：现有大型语言模型（LLMs）在静态基准测试上的评估易受数据污染和排行榜过拟合的影响，这些关键问题掩盖了模型的真实能力。为解决此问题，我们提出了LLMEval-3，一个用于动态评估LLMs的框架。LLMEval-3基于一个包含22万道研究生水平试题的专有题库构建，每次评估运行时动态采样未见过的测试集。其自动化流程通过抗污染数据管理、新颖的反作弊架构，以及一个与人类专家达成90%一致性的校准化LLM-as-a-judge评判过程，确保了评估的完整性，并辅以相对排名系统以实现公平比较。一项为期20个月、涵盖近50个领先模型的纵向研究揭示了知识记忆的性能上限，并暴露了静态基准无法检测的数据污染漏洞。该框架在排名稳定性和一致性方面展现出卓越的稳健性，为动态评估范式提供了强有力的实证验证。LLMEval-3提供了一种超越排行榜分数的、稳健可信的评估方法，用于衡量LLMs的真实能力，有助于推动更可信赖的评估标准的发展。

0

相关内容

金融领域大型语言模型综述（FinLLMs）

金融领域大型语言模型综述（FinLLMs）

专知会员服务

71+阅读 · 2024年2月6日

【纽约大学 Ethan Perez 博士论文】在预训练语言模型中发现和修正不良行为，217页pdf，，Finding and Fixing Undesirable Behaviors in Pretrained Language Models

【纽约大学 Ethan Perez 博士论文】在预训练语言模型中发现和修正不良行为，217页pdf，，Finding and Fixing Undesirable Behaviors in Pretrained Language Models

专知会员服务

18+阅读 · 2022年3月16日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

专知会员服务

19+阅读 · 2022年3月1日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

33+阅读 · 2020年4月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

【AutoML】自动机器学习：最近进展研究综述 AutoML：A survey of State-of-the-art

【AutoML】自动机器学习：最近进展研究综述 AutoML：A survey of State-of-the-art

产业智能官

15+阅读 · 2019年8月13日

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

产业智能官

18+阅读 · 2018年7月26日

Facebook开源MUSE：多语言无监督和监督词向量库

Facebook开源MUSE：多语言无监督和监督词向量库

论智

20+阅读 · 2017年12月23日

LibRec 每周算法：LDA主题模型

LibRec 每周算法：LDA主题模型

LibRec智能推荐

29+阅读 · 2017年12月4日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于图论方法的DNA序列编码研究

国家自然科学基金

2+阅读 · 2016年12月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于格值逻辑的语言真值α-群锁语义归结自动推理研究

国家自然科学基金

0+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

P3P问题解分布的临界曲面研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

Biot模型基于有限元离散的多重网格算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Arxiv

11+阅读 · 2024年1月16日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

A Survey of Knowledge Enhanced Pre-trained Models

Arxiv

28+阅读 · 2021年10月1日

Z-GCNETs: Time Zigzags at Graph Convolutional Networks for Time Series Forecasting

Arxiv

10+阅读 · 2021年5月10日

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Arxiv

13+阅读 · 2020年4月13日

Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods

Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods

Arxiv

15+阅读 · 2020年4月3日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Graph Neural Networks: A Review of Methods and Applications

Graph Neural Networks: A Review of Methods and Applications

Arxiv

76+阅读 · 2018年12月20日

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Arxiv

10+阅读 · 2018年8月29日

Learning beyond datasets: Knowledge Graph Augmented Neural Networks for Natural language Processing

Arxiv

11+阅读 · 2018年2月16日

VIP会员

文章信息

相关主题

相关VIP内容

金融领域大型语言模型综述（FinLLMs）

金融领域大型语言模型综述（FinLLMs）

专知会员服务

71+阅读 · 2024年2月6日

【纽约大学 Ethan Perez 博士论文】在预训练语言模型中发现和修正不良行为，217页pdf，，Finding and Fixing Undesirable Behaviors in Pretrained Language Models

【纽约大学 Ethan Perez 博士论文】在预训练语言模型中发现和修正不良行为，217页pdf，，Finding and Fixing Undesirable Behaviors in Pretrained Language Models

专知会员服务

18+阅读 · 2022年3月16日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

专知会员服务

19+阅读 · 2022年3月1日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

33+阅读 · 2020年4月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机与战争：被忽视的环境影响及无人机保护潜力》

俄罗斯规划未来无人机驱动军队

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

《人工智能、武器与影响力：前沿模型在模拟核危机中展现复杂推理》2026最新46页报告

相关资讯

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

【AutoML】自动机器学习：最近进展研究综述 AutoML：A survey of State-of-the-art

【AutoML】自动机器学习：最近进展研究综述 AutoML：A survey of State-of-the-art

产业智能官

15+阅读 · 2019年8月13日

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

【语义分割】一文概览主要语义分割网络：FCN,SegNet,U-Net...

产业智能官

18+阅读 · 2018年7月26日

Facebook开源MUSE：多语言无监督和监督词向量库

Facebook开源MUSE：多语言无监督和监督词向量库

论智

20+阅读 · 2017年12月23日

LibRec 每周算法：LDA主题模型

LibRec 每周算法：LDA主题模型

LibRec智能推荐

29+阅读 · 2017年12月4日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

相关论文

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Arxiv

11+阅读 · 2024年1月16日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

A Survey of Knowledge Enhanced Pre-trained Models

Arxiv

28+阅读 · 2021年10月1日

Z-GCNETs: Time Zigzags at Graph Convolutional Networks for Time Series Forecasting

Arxiv

10+阅读 · 2021年5月10日

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Arxiv

13+阅读 · 2020年4月13日

Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods

Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods

Arxiv

15+阅读 · 2020年4月3日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Graph Neural Networks: A Review of Methods and Applications

Graph Neural Networks: A Review of Methods and Applications

Arxiv

76+阅读 · 2018年12月20日

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Arxiv

10+阅读 · 2018年8月29日

Learning beyond datasets: Knowledge Graph Augmented Neural Networks for Natural language Processing

Arxiv

11+阅读 · 2018年2月16日

相关基金

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

基于图论方法的DNA序列编码研究

国家自然科学基金

2+阅读 · 2016年12月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于格值逻辑的语言真值α-群锁语义归结自动推理研究

国家自然科学基金

0+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

P3P问题解分布的临界曲面研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

Biot模型基于有限元离散的多重网格算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员