Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.
翻译:当前大语言模型(LLM)的评估方法主要依赖静态基准测试,存在两大挑战:知识覆盖范围有限以及难度固定导致的与待评估模型能力不匹配。这些限制导致对LLM知识的评估流于表面,从而阻碍了针对性的模型优化。为弥补这一差距,我们提出了JudgeAgent——一个面向LLM的知识驱动动态评估框架。针对知识覆盖有限的问题,JudgeAgent利用配备上下文图谱的LLM智能体,系统遍历知识结构以生成问题。此外,为缓解数据污染和难度失配问题,该框架采用难度自适应多轮访谈机制。由此,JudgeAgent能够实现全面评估并促进LLM更有效的改进。实证结果表明,JudgeAgent能够实现更全面的评估并推动有效的模型迭代,彰显了这种知识驱动动态评估范式的潜力。源代码发布于https://github.com/DataArcTech/JudgeAgent。