As machine intelligence evolves, the need to test and compare the problem-solving abilities of different AI models grows. However, current benchmarks are often simplistic, allowing models to perform uniformly well and making it difficult to distinguish their capabilities. Additionally, benchmarks typically rely on static question-answer pairs that the models might memorize or guess. To address these limitations, we introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models using dynamic question templates and improved metrics across multiple disciplines such as mathematics, cryptography, cybersecurity, and computer science. The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats, including text, PDFs, compiled binaries, visual puzzles, and CTF-style cybersecurity challenges. Our framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts. These metrics revealed that even simple questions are frequently answered incorrectly when posed in varying forms, highlighting significant gaps in models' reliability. Notably, API models like GPT-4o often overestimated their mathematical capabilities, while ChatGPT-4o demonstrated better performance due to effective tool usage. In self-assessment, OpenAI's o1-mini proved to have the best judgement on what tasks it should attempt to solve. We evaluated 25 state-of-the-art LLMs using DIA-Bench, showing that current models struggle with complex tasks and often display unexpectedly low confidence, even with simpler questions. The DIA framework sets a new standard for assessing not only problem-solving but also a model's adaptive intelligence and ability to assess its limitations. The dataset is publicly available on the project's page: https://github.com/DIA-Bench.
翻译:随着机器智能的发展,测试和比较不同人工智能模型解决问题能力的需求日益增长。然而,当前的基准测试通常过于简化,使得模型表现趋同,难以区分其能力。此外,基准测试通常依赖于静态的问答对,模型可能通过记忆或猜测来应对。为应对这些局限性,我们引入了动态智能评估(DIA),这是一种利用动态问题模板和改进的度量指标来测试人工智能模型的新方法,涵盖数学、密码学、网络安全和计算机科学等多个学科。配套数据集DIA-Bench包含一系列多样化的挑战模板,其参数可变,并以多种格式呈现,包括文本、PDF、编译后的二进制文件、视觉谜题和CTF风格的网络安全挑战。我们的框架引入了四个新指标,用于评估模型在多次尝试中的可靠性和置信度。这些指标揭示,即使是简单问题,当以不同形式提出时,也经常被错误回答,突显了模型可靠性的显著差距。值得注意的是,像GPT-4o这样的API模型常常高估其数学能力,而ChatGPT-4o由于有效使用工具而表现出更好的性能。在自我评估方面,OpenAI的o1-mini被证明在判断应尝试解决哪些任务方面具有最佳判断力。我们使用DIA-Bench评估了25个最先进的大语言模型,结果表明当前模型在处理复杂任务时存在困难,并且即使在面对较简单问题时也常常表现出意料之外的低置信度。DIA框架为评估不仅解决问题能力,还包括模型的适应性智能及其评估自身局限性的能力,设定了新标准。该数据集已在项目页面公开提供:https://github.com/DIA-Bench。