Today's large language models (LLMs) can solve challenging question-answering tasks, and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention for enhancing the explanation and correctness of outputs. Nevertheless, models require significant time to generate answers augmented with lengthy reasoning details. To address this issue, this paper analyzes the impact of output lengths on LLM inference pipelines and proposes novel metrics to evaluate them in terms of \textit{correct conciseness}. It also examines the impact of controlling output length through a refined prompt engineering strategy, Constrained-CoT (CCoT), which encourages the model to limit output length. Experiments on pre-trained LLMs demonstrated the benefit of the proposed metrics and the effectiveness of CCoT across different models. For instance, constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01\% (CoT) to 41.07\% (CCoT) on the GSM8K dataset, while reducing the average output length by 28 words.
翻译:当前的大型语言模型(LLMs)能够解决具有挑战性的问答任务,而诸如思维链(CoT)等提示工程技术因其能增强输出的解释性与正确性而备受关注。然而,模型需要大量时间来生成包含冗长推理细节的答案。为解决此问题,本文分析了输出长度对LLM推理流程的影响,并提出了评估其\textit{正确简洁性}的新颖指标。同时,本文通过一种改进的提示工程策略——约束思维链(CCoT)——研究了控制输出长度的影响,该策略鼓励模型限制输出长度。在预训练LLMs上的实验证明了所提指标的优越性以及CCoT在不同模型间的有效性。例如,在GSM8K数据集上,将LLaMA2-70b的推理约束至100个词,其准确率从36.01%(CoT)提升至41.07%(CCoT),同时平均输出长度减少了28个词。