ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. Existing benchmarks are either manually constructed or are automatic, but lack the ability to evaluate the thought process of LLMs with arbitrary complexity. We contend that utilizing existing relational databases based on the entity-relationship (ER) model is a promising approach for constructing benchmarks as they contain structured knowledge that can be used to question LLMs. Unlike knowledge graphs, which are also used to evaluate LLMs, relational databases have integrity constraints that can be used to better construct complex in-depth questions and verify answers: (1) functional dependencies can be used to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values; and (2) foreign key constraints can be used to join relations and construct multi-hop questions, which can be arbitrarily long and used to debug intermediate answers. We thus propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark. ERBench supports continuous evaluation as databases change, multimodal questions, and various prompt engineering techniques. In our experiments, we construct LLM benchmarks using databases of multiple domains and make an extensive comparison of contemporary LLMs. We show how ERBench can properly evaluate any LLM by not only checking for answer correctness, but also effectively verifying the rationales by looking for the right keywords.

翻译：大型语言模型（LLM）在各种应用中取得了前所未有的性能，但其评估仍具挑战性。现有基准要么依赖人工构建，要么虽能自动生成，却缺乏评估任意复杂度下LLM思维过程的能力。我们认为，利用基于实体关系（ER）模型的现有关系数据库是构建基准的有效途径，因其包含可用于向LLM提问的结构化知识。与同样用于评估LLM的知识图谱不同，关系数据库具有完整性约束，可用于更好地构建复杂深度问题并验证答案：（1）函数依赖可用于精确定位LLM必须掌握的关键词，以正确回答包含特定属性值的问题；（2）外键约束可用于连接关系表并构建多跳问题，此类问题可任意扩展长度并用于调试中间答案。为此，我们提出ERBench，该框架利用这些完整性约束将任意数据库转化为LLM基准。ERBench支持随数据库更新的持续评估、多模态问题及多种提示工程技术。实验中，我们使用多领域数据库构建LLM基准，并对当代主流LLM进行了全面比较。研究表明，ERBench不仅能通过检查答案正确性来有效评估任意LLM，还能通过定位关键词语实现对推理过程的有效验证。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日