Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.
翻译:大型语言模型(LLMs)因其卓越能力而广受欢迎,但模型特定的微调或任务特定的提示工程可能限制其泛化性。我们提出UPRISE(通用提示检索提升零样本评估),该方法训练了一个轻量级且多功能的检索器,可自动为给定零样本任务输入检索提示。具体而言,我们在跨任务和跨模型场景中证明了其通用性:检索器在多样任务集上训练,但在未见任务类型上测试;我们使用小型冻结LLM(GPT-Neo-2.7B)训练检索器,但将其应用于不同规模更大的LLM(如BLOOM-7.1B、OPT-66B和GPT3-175B)进行测试。此外,实验表明UPRISE能缓解ChatGPT中的幻觉问题,表明其具有改进最强LLMs的潜力。我们的模型和代码开源在https://github.com/microsoft/LMOps。