The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 515 challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40\%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts. We also observe that by mere prompting, GPT-4 is unable to assess risk introduced by negative marking for incorrect answers. For this, we develop a post-hoc confidence-thresholding method over self-consistency, which enables effective response selection. We hope that our challenging benchmark will guide future re-search in problem-solving using LLMs.
翻译:近年来,大语言模型在现有推理基准上的性能显著提升。为此,我们提出了JEEBench,这是一个用于评估大语言模型问题求解能力的更具挑战性的基准数据集。我们从竞争激烈的印度理工学院联合入学考试高级部分中精心挑选了515道工程预科数学、物理和化学难题。解决该基准中的问题需要基于深厚领域知识进行长程推理。我们对多种开源和闭源模型的评估显示,即使采用自一致性、自我优化和思维链提示等技术,最高性能仍低于40%。最佳模型GPT-4的典型失败模式包括:代数运算错误、难以将抽象概念准确转化为数学方程、以及无法检索相关领域特定概念。我们还观察到,仅通过提示,GPT-4无法评估错误答案的倒扣分带来的风险。为此,我们开发了一种基于自一致性的后验置信度阈值方法,实现了有效的答案选择。希望我们提出的具有挑战性的基准能够指导未来使用大语言模型进行问题求解的研究。