Large language models (LLMs) can solve problems step-by-step. While this chain-of-thought (CoT) reasoning boosts LLMs' performance, it is unclear if LLMs \textit{know} when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero. GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions. We also conduct experiments to explain why LLMs generate redundant calculations and reasonings. GSM8K-Zero is publicly available at https://github.com/d223302/Over-Reasoning-of-LLMs and https://huggingface.co/datasets/dcml0714/GSM8K-Zero.
翻译:大型语言模型(LLMs)能够逐步解决问题。尽管这种思维链(CoT)推理提升了LLMs的性能,但目前尚不清楚LLMs是否“知道”何时使用CoT,以及这些CoT是否对回答问题总是必要的。本文表明,在人工构建的数学问答数据集GSM8K-Zero上,LLMs倾向于生成冗余计算和推理。GSM8K-Zero的设计使得问题无需任何计算即可回答,但包括Llama-2系列模型和Claude-2在内的LLMs,却倾向于生成冗长且不必要的计算来解答问题。我们还通过实验解释了LLMs生成冗余计算和推理的原因。GSM8K-Zero公开可获取,地址为https://github.com/d223302/Over-Reasoning-of-LLMs和https://huggingface.co/datasets/dcml0714/GSM8K-Zero。