The performance on Large Language Models (LLMs) on existing reasoning benchmarks has shot up considerably over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 450 challenging pre-engineering mathematics, physics and chemistry problems from the IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on the GPT series of models reveals that although performance improves with newer models, the best being GPT-4, the highest performance, even after using techniques like Self-Consistency and Chain-of-Thought prompting is less than 40 percent. Our analysis demonstrates that errors in algebraic manipulation and failure in retrieving relevant domain specific concepts are primary contributors to GPT4's low performance. Given the challenging nature of the benchmark, we hope that it can guide future research in problem solving using LLMs. Our code and dataset is available here.
翻译:近年来,大语言模型在现有推理基准上的表现显著提升。为此,我们提出JEEBench,一个更具挑战性的基准数据集,用于评估大语言模型的问题求解能力。我们从印度理工学院联合入学考试高级阶段中精选了450道高难度预科数学、物理和化学问题。求解该基准中的问题需要基于深厚的领域知识进行长程推理。我们对GPT系列模型的评估显示,尽管新模型的性能有所提升(最佳模型为GPT-4),但即便采用自一致性推理与思维链提示等技术,其最高性能仍低于40%。我们的分析表明,代数运算错误及未能检索相关领域特定概念是导致GPT-4性能低下的主要原因。鉴于该基准的挑战性,我们期望它能引导未来利用大语言模型进行问题求解的研究。我们的代码与数据集可在以下链接获取。