Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3's reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM's factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.
翻译:大型语言模型通过少样本提示展现出令人瞩目的能力。诸如OpenAI的GPT-3等商业化应用程序接口进一步提升了其在真实世界语言应用中的使用频率。然而,如何提升GPT-3可靠性的关键问题仍未得到充分探索。尽管"可靠性"是一个宽泛且定义模糊的术语,但我们将其分解为与现有机器学习安全框架相对应的四个主要维度,这些维度被公认至关重要:泛化能力、社会偏见、校准性能与事实准确性。我们的核心贡献在于建立简洁有效的提示方法,使GPT-3能够:1) 实现分布外泛化,2) 平衡人口统计学分布并运用自然语言指令减少社会偏见,3) 校准输出概率,4) 更新语言模型的事实知识与推理链。通过恰当的提示,GPT-3在这四个维度上均比小规模监督模型表现得更可靠。我们公开了所有处理后的数据集、评估脚本及模型预测结果。本系统性实证研究不仅揭示了提示语言模型可靠性的新见解,更重要的是,我们的提示策略能帮助实践者更可靠地使用GPT-3这类大型语言模型。