This study critically evaluates the efficacy of prompting methods in enhancing the mathematical reasoning capability of large language models (LLMs). The investigation uses three prescriptive prompting methods - simple, persona, and conversational prompting - known for their effectiveness in enhancing the linguistic tasks of LLMs. We conduct this analysis on OpenAI's LLM chatbot, ChatGPT-3.5, on extensive problem sets from the MATH, GSM8K, and MMLU datasets, encompassing a broad spectrum of mathematical challenges. A grading script adapted to each dataset is used to determine the effectiveness of these prompting interventions in enhancing the model's mathematical analysis power. Contrary to expectations, our empirical analysis reveals that none of the investigated methods consistently improves over ChatGPT-3.5's baseline performance, with some causing significant degradation. Our findings suggest that prompting strategies do not necessarily generalize to new domains, in this study failing to enhance mathematical performance.
翻译:本研究批判性评估了提示方法在提升大型语言模型(LLMs)数学推理能力方面的有效性。实验采用三种规范性提示方法——简单提示、角色提示和对话提示——这些方法在增强LLMs语言任务方面已被证明有效。我们以OpenAI的LLM聊天机器人ChatGPT-3.5为对象,在MATH、GSM8K和MMLU数据集的大规模问题集上进行分析,涵盖了广泛的数学挑战。利用针对每个数据集定制的评分脚本,评估这些提示干预在增强模型数学分析能力方面的效果。与预期相反,我们的实证分析显示,所研究的任何方法均未持续超越ChatGPT-3.5的基线性能,且部分方法导致显著性能下降。研究结果表明,提示策略未必能泛化至新领域,在本研究中未能提升数学性能。