Large language models (LLMs) excel on new tasks without additional training, simply by providing natural language prompts that demonstrate how the task should be performed. Prompt ensemble methods comprehensively harness the knowledge of LLMs while mitigating individual biases and errors and further enhancing performance. However, more prompts do not necessarily lead to better results, and not all prompts are beneficial. A small number of high-quality prompts often outperform many low-quality prompts. Currently, there is a lack of a suitable method for evaluating the impact of prompts on the results. In this paper, we utilize the Shapley value to fairly quantify the contributions of prompts, helping to identify beneficial or detrimental prompts, and potentially guiding prompt valuation in data markets. Through extensive experiments employing various ensemble methods and utility functions on diverse tasks, we validate the effectiveness of using the Shapley value method for prompts as it effectively distinguishes and quantifies the contributions of each prompt.
翻译:大型语言模型(LLMs)无需额外训练即可在新任务上表现出色,仅需提供展示任务执行方式的自然语言提示词即可。提示词集成方法能够全面利用LLMs的知识,同时减轻个体偏差与错误,进一步提升性能。然而,更多提示词未必带来更好结果,且并非所有提示词都具有增益作用。少量高质量提示词往往优于大量低质量提示词。目前尚缺乏合适的方法来评估提示词对结果的影响。本文利用Shapley值公平量化提示词的贡献度,有助于识别有益或有害的提示词,并可能为数据市场中的提示词价值评估提供指导。通过在多样化任务中采用多种集成方法与效用函数进行大量实验,我们验证了Shapley值方法用于提示词评估的有效性,因其能有效区分并量化每个提示词的贡献。