Math Word Problems (MWPs) are crucial for evaluating the capability of Large Language Models (LLMs), with current research primarily focusing on questions with concise contexts. However, as real-world math problems often involve complex circumstances, LLMs' ability to solve long MWPs is vital for their applications in these scenarios, yet remains under-explored. This study pioneers the exploration of Context Length Generalizability (CoLeG), the ability of LLMs to solve long MWPs. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs with lengthy narratives. Two novel metrics are proposed to assess the efficacy and resilience of LLMs in solving these problems. Our examination of existing zero-shot prompting techniques and both proprietary and open-source LLMs reveals a general deficiency in CoLeG. To alleviate these challenges, we propose distinct approaches for different categories of LLMs. For proprietary LLMs, a new instructional prompt is proposed to mitigate the influence of long context. For open-source LLMs, a new data augmentation task is developed to improve CoLeG. Our comprehensive results demonstrate the effectiveness of our proposed methods, showing not only improved performance on E-GSM but also generalizability across several other MWP benchmarks. Our findings pave the way for future research in employing LLMs for complex, real-world applications, offering practical solutions to current limitations and opening avenues for further exploration of model generalizability and training methodologies.
翻译:数学应用题是评估大语言模型能力的关键领域,当前研究主要集中于语境简洁的问题。然而,由于现实世界中的数学问题常涉及复杂情境,大语言模型解决长篇数学应用题的能力对其在这些场景中的应用至关重要,但该能力仍未得到充分探索。本研究开创性地探讨了语境长度泛化能力,即大语言模型解决长篇数学应用题的能力。我们引入了扩展小学数学题集,这是一个包含长篇叙述的数学应用题集合。我们提出了两个新颖的指标来评估大语言模型解决这些问题的效能与鲁棒性。通过对现有零样本提示技术及商业与开源大语言模型的检验,我们发现模型普遍缺乏语境长度泛化能力。为缓解这些挑战,我们针对不同类别的大语言模型提出了差异化方案。对于商业大语言模型,我们提出了一种新的指令提示以减轻长语境的影响。对于开源大语言模型,我们开发了一种新的数据增强任务以提升其语境长度泛化能力。我们的综合实验结果证明了所提方法的有效性,不仅在扩展小学数学题集上表现出性能提升,还在其他多个数学应用题基准测试中展现出泛化能力。本研究为未来将大语言模型应用于复杂现实场景的研究铺平了道路,为当前局限性提供了实用解决方案,并为模型泛化能力与训练方法的进一步探索开辟了新途径。