Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some but not all tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.
翻译:近期研究表明,将自然语言以类代码表示形式输入语言模型,能够提升其在结构化推理任务上的性能。然而,这类任务仅占所有自然语言任务的一小部分。本研究旨在验证代码提示是否为与语言模型交互的通用优选方式。我们比较了三种主流GPT模型(davinci、code-davinci-002和text-davinci-002)在更广泛任务(如问答、情感分析、摘要生成)中代码提示与文本提示的效果差异,发现除少数例外情况,代码提示并未始终优于文本提示。此外,研究表明代码提示的风格对部分任务(而非全部)的性能具有显著影响,而基于文本指令的微调能提升代码提示的相对表现。