Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence

Norbert Tihanyi,Tamas Bisztray,Richard A. Dubniczky,Rebeka Toth,Bertalan Borsos,Bilel Cherif,Mohamed Amine Ferrag,Lajos Muzsai,Ridhi Jain,Ryan Marinelli,Lucas C. Cordeiro,Merouane Debbah,Vasileios Mavroeidis,Audun Josang

As machine intelligence evolves, the need to test and compare the problem-solving abilities of different AI models grows. However, current benchmarks are often simplistic, allowing models to perform uniformly well and making it difficult to distinguish their capabilities. Additionally, benchmarks typically rely on static question-answer pairs that the models might memorize or guess. To address these limitations, we introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models using dynamic question templates and improved metrics across multiple disciplines such as mathematics, cryptography, cybersecurity, and computer science. The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats, including text, PDFs, compiled binaries, visual puzzles, and CTF-style cybersecurity challenges. Our framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts. These metrics revealed that even simple questions are frequently answered incorrectly when posed in varying forms, highlighting significant gaps in models' reliability. Notably, API models like GPT-4o often overestimated their mathematical capabilities, while ChatGPT-4o demonstrated better performance due to effective tool usage. In self-assessment, OpenAI's o1-mini proved to have the best judgement on what tasks it should attempt to solve. We evaluated 25 state-of-the-art LLMs using DIA-Bench, showing that current models struggle with complex tasks and often display unexpectedly low confidence, even with simpler questions. The DIA framework sets a new standard for assessing not only problem-solving but also a model's adaptive intelligence and ability to assess its limitations. The dataset is publicly available on the project's page: https://github.com/DIA-Bench.

翻译：随着机器智能的发展，测试和比较不同人工智能模型解决问题能力的需求日益增长。然而，当前的基准测试通常过于简化，使得模型表现趋同，难以区分其能力。此外，基准测试通常依赖于静态的问答对，模型可能通过记忆或猜测来应对。为应对这些局限性，我们引入了动态智能评估（DIA），这是一种利用动态问题模板和改进的度量指标来测试人工智能模型的新方法，涵盖数学、密码学、网络安全和计算机科学等多个学科。配套数据集DIA-Bench包含一系列多样化的挑战模板，其参数可变，并以多种格式呈现，包括文本、PDF、编译后的二进制文件、视觉谜题和CTF风格的网络安全挑战。我们的框架引入了四个新指标，用于评估模型在多次尝试中的可靠性和置信度。这些指标揭示，即使是简单问题，当以不同形式提出时，也经常被错误回答，突显了模型可靠性的显著差距。值得注意的是，像GPT-4o这样的API模型常常高估其数学能力，而ChatGPT-4o由于有效使用工具而表现出更好的性能。在自我评估方面，OpenAI的o1-mini被证明在判断应尝试解决哪些任务方面具有最佳判断力。我们使用DIA-Bench评估了25个最先进的大语言模型，结果表明当前模型在处理复杂任务时存在困难，并且即使在面对较简单问题时也常常表现出意料之外的低置信度。DIA框架为评估不仅解决问题能力，还包括模型的适应性智能及其评估自身局限性的能力，设定了新标准。该数据集已在项目页面公开提供：https://github.com/DIA-Bench。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日