Large Language Models (LLMs) have demonstrated remarkable problem-solving and basic mathematics abilities. However, their efficacy is highly contingent on the formulation of the prompt. This study endeavors to quantify the influence of incorporating "positive thinking" into the system message of the prompt, then compare that to systematic prompt optimization. We assess the performance of 60 combinations of system message snippets, tested with and without Chain of Thought prompting, across three models with parameters ranging from 7 to 70 billion on the GSM8K dataset. Our findings reveal that results do not universally generalize across models. In most instances, the inclusion of "positive thinking" prompts positively affected model performance. Notably, however, Llama2-70B exhibited an exception when not utilizing Chain of Thought, as the optimal system message was found to be none at all. Given the combinatorial complexity, and thus computation time, of experimenting with hand-tuning prompts for large black-box models, we then compared the performance of the best "positive thinking" prompt against the output of systematic prompt optimization. We show that employing an automated prompt optimizer emerges as the most effective method for enhancing performance, even when working with smaller open-source models. Additionally, our findings reveal that the highest-scoring, automatically-optimized prompt exhibits a degree of peculiarity far beyond expectations.
翻译:大型语言模型(LLMs)在问题解决和基础数学任务中展现了卓越的能力。然而,其效能高度依赖于提示词的表述方式。本研究致力于量化在提示词系统消息中引入“积极思维”的影响,并将其与系统性提示词优化进行比较。我们评估了60组系统消息片段组合,分别在有无思维链提示的情况下进行测试,涵盖参数规模从70亿到700亿的三种模型,并基于GSM8K数据集进行性能评估。研究发现,结果并未在所有模型中呈现统一泛化性。在多数情况下,包含“积极思维”的提示词对模型性能有积极影响。但值得注意的是,Llama2-70B在未使用思维链时表现出例外,其最优系统消息竟是不设任何内容。鉴于手动调整大型黑箱模型提示词面临的组合复杂度及计算时间成本,我们进一步将最优“积极思维”提示词与系统性提示词优化的输出进行了性能对比。结果表明,即使针对较小规模的开源模型,采用自动提示词优化器仍是提升性能最有效的方法。此外,我们的发现揭示,得分最高的自动优化提示词呈现出远超预期的奇特程度。