Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. Existing benchmarks are either manually constructed or are automatic, but lack the ability to evaluate the thought process of LLMs with arbitrary complexity. We contend that utilizing existing relational databases based on the entity-relationship (ER) model is a promising approach for constructing benchmarks as they contain structured knowledge that can be used to question LLMs. Unlike knowledge graphs, which are also used to evaluate LLMs, relational databases have integrity constraints that can be used to better construct complex in-depth questions and verify answers: (1) functional dependencies can be used to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values; and (2) foreign key constraints can be used to join relations and construct multi-hop questions, which can be arbitrarily long and used to debug intermediate answers. We thus propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark. ERBench supports continuous evaluation as databases change, multimodal questions, and various prompt engineering techniques. In our experiments, we construct LLM benchmarks using databases of multiple domains and make an extensive comparison of contemporary LLMs. We show how ERBench can properly evaluate any LLM by not only checking for answer correctness, but also effectively verifying the rationales by looking for the right keywords.
翻译:大型语言模型(LLM)在各种应用中取得了前所未有的性能,但其评估仍具挑战性。现有基准要么依赖人工构建,要么虽能自动生成,却缺乏评估任意复杂度下LLM思维过程的能力。我们认为,利用基于实体关系(ER)模型的现有关系数据库是构建基准的有效途径,因其包含可用于向LLM提问的结构化知识。与同样用于评估LLM的知识图谱不同,关系数据库具有完整性约束,可用于更好地构建复杂深度问题并验证答案:(1)函数依赖可用于精确定位LLM必须掌握的关键词,以正确回答包含特定属性值的问题;(2)外键约束可用于连接关系表并构建多跳问题,此类问题可任意扩展长度并用于调试中间答案。为此,我们提出ERBench,该框架利用这些完整性约束将任意数据库转化为LLM基准。ERBench支持随数据库更新的持续评估、多模态问题及多种提示工程技术。实验中,我们使用多领域数据库构建LLM基准,并对当代主流LLM进行了全面比较。研究表明,ERBench不仅能通过检查答案正确性来有效评估任意LLM,还能通过定位关键词语实现对推理过程的有效验证。