We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.
翻译:我们研究了一种商用大语言模型(LLM)——ChatGPT,在DRAW-1K数据集的数学文字题(MWP)上的表现。据我们所知,这是首次对ChatGPT进行的独立评估。我们发现,ChatGPT的表现会因是否要求展示解题过程而发生显著变化:当需展示过程时,其失败率为20%,而不展示过程时失败率高达84%。此外,与先前研究相比,MWP中未知数数量和运算步骤数量等因素会导致更高的失败概率,具体而言(在所有实验中)发现失败概率随加减运算数量的增加呈线性上升。我们还发布了ChatGPT对MWP的响应数据集,以支持对大语言模型性能特征的进一步研究,并提出了基准机器学习模型来预测ChatGPT能否正确解答MWP。我们已发布包含ChatGPT响应的数据集,以促进该领域的进一步研究。