Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.
翻译:大型语言模型(LLMs)因其卓越能力而广受欢迎,但模型特定微调或任务特定提示工程的需求可能限制其泛化性。我们提出UPRISE(通用提示检索以改进零样本评估)方法,该方法训练一个轻量级多功能检索器,可针对给定的零样本任务输入自动检索提示。具体而言,我们在跨任务与跨模型场景中验证其通用性:检索器通过多样化任务集进行训练,却在未见任务类型上进行测试;我们采用小型冻结LLM(GPT-Neo-2.7B)训练检索器,但将其部署于不同的大规模LLM(如BLOOM-7.1B、OPT-66B和GPT3-175B)上进行测试。此外,实验表明UPRISE可缓解ChatGPT中的幻觉问题,暗示其具有改进甚至最强LLMs的潜力。我们的模型与代码已开源至https://github.com/microsoft/LMOps。