Emergent chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs). However, uncertainties remain about how prompting strategies formulated for previous model generations generalize to new model generations and different datasets. In this small-scale study we compare the performance of a range of zero-shot prompts for inducing CoT reasoning across six recently released LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge) on a mixture of six question-answering datasets, including datasets from scientific and medical domains. We find that a CoT prompt that was previously discovered through automated prompt discovery shows robust performance across experimental conditions and produces best results when applied to the state-of-the-art model GPT-4.
翻译:涌现的思维链(Chain-of-Thought, CoT)推理能力有望提升大语言模型(LLMs)的性能与可解释性。然而,针对先前模型代际设计的提示策略如何泛化至新模型代际及不同数据集,仍存在不确定性。在本项小规模研究中,我们比较了多种零样本提示在诱导六种近期发布的LLM(包括davinci-002、davinci-003、GPT-3.5-turbo、GPT-4、Flan-T5-xxl及Cohere command-xlarge)进行CoT推理时的性能表现,所涉数据集涵盖科学及医学领域的六个问答数据集。研究发现,通过自动化提示发现方法先前获得的CoT提示在所有实验条件下均表现稳健,并在应用于最先进模型GPT-4时取得最优结果。