We present GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models. Through extensive experiments, we show that LLMs' performance on the MC version of this popular benchmark is strongly correlated with their performance on the original version and is quite robust to distractor choices and option orders, while the evaluation time is reduced by a factor of up to 30. Following similar procedures, we introduce MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP. Experimental results indicate that LLMs' performance on these MC benchmarks leaves much room for improvement. Our data and code are available at https://github.com/Geralt-Targaryen/MC-Evaluation.
翻译:我们提出了GSM-MC,这是一个通过收集60个开源模型在GSM8K上的答案与错误预测构建而成的多项选择题数据集。通过大量实验,我们证明大语言模型在这一热门基准的多选题版本上的表现,与其在原始版本上的表现高度相关,并且对干扰项选择和选项顺序具有相当的鲁棒性,同时评估时间最多可缩短30倍。遵循类似流程,我们基于MATH构建了MATH-MC,并基于HumanEval和MBPP构建了一个新的程序推理多选题数据集PythonIO。实验结果表明,大语言模型在这些多选题基准上的表现仍有很大提升空间。我们的数据与代码公开于https://github.com/Geralt-Targaryen/MC-Evaluation。