Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.
翻译:大语言模型因其卓越能力广受欢迎,但特定模型微调或任务导向的提示工程需求可能制约其泛化性。我们提出UPRISE(通用提示检索提升零样本评估),该方法通过训练轻量级多功能检索器,自动为给定的零样本任务输入检索提示。具体而言,我们在跨任务与跨模型场景中验证其通用性:检索器基于多样任务集训练,却测试于未见任务类型;使用小型冻结LLM(GPT-Neo-2.7B)训练检索器,却将其应用于不同规模的大模型(如BLOOM-7.1B、OPT-66B与GPT3-175B)。此外,实验表明UPRISE可缓解ChatGPT的幻觉问题,揭示其有望改进甚至最强大型语言模型的能力。模型与代码已开源至 https://github.com/microsoft/LMOps。