Evaluating Language Models for Mathematics through Interactions

Katherine M. Collins,Albert Q. Jiang,Simon Frieder,Lionel Wong,Miri Zilka,Umang Bhatt,Thomas Lukasiewicz,Yuhuai Wu,Joshua B. Tenenbaum,William Hart,Timothy Gowers,Wenda Li,Adrian Weller,Mateja Jamnik

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models' algebraic fallibility, and for that reason discern where they should be used.

翻译：基于静态输入-输出对的标准评估方法不足以充分发展语言模型助手：这类评估未能考虑其部署中的关键交互元素，从而限制了我们理解语言模型的能力。我们提出CheckMate——一个供人类与语言模型交互并对其进行评估的自适应原型平台。通过CheckMate开展了一项研究，评估三个语言模型（InstructGPT、ChatGPT和GPT-4）作为本科数学证明助手的表现，受试者涵盖从本科生到数学教授的混合群体。我们发布了由此产生的交互与评级数据集MathConverse。通过分析MathConverse，我们推导出人类行为的初步分类，并发现尽管模型输出正确性与感知有用性总体呈正相关，但在特定情况下两者存在显著偏差。此外，通过数学专家贡献的一系列案例研究，我们识别了GPT-4在数学推理中的有效场景和现有问题。最后为机器学习从业者和数学家提出可操作的见解：具备不确定性表达、能良好响应用户修正、更加可解释且简洁的模型更适合作为助手；交互式评估是持续探索模型能力的有效方法；人类需警惕语言模型在代数推理上的局限性，并据此审慎决定其应用领域。