Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint story generation task, BSM improves the coherence of the stories while also improving constraint satisfaction by 12%.
翻译:大型语言模型(LLMs)常用于处理多维度语言生成与评估任务,这类任务需满足复杂用户约束或兼顾多方面标准。然而,由于模型缺乏连贯性且无法有效规划与分解问题,其性能可能受限。我们提出分支-求解-合并(BSM)方法——一种面向此类挑战性自然语言任务的大型语言模型程序(Schlag et al., 2023)。该方法包含分支、求解和合并三个模块,通过向基础LLM输入特定提示参数实现功能。这三个模块负责将任务分解为多个并行子任务,独立求解各子任务,并融合子任务解决方案。我们将该方法应用于LLM响应评估与约束文本生成任务,使用Vicuna、LLaMA-2-chat及GPT-4等多种LLM验证其有效性。BSM通过将人机一致率提升最高26%、将长度偏差与成对位置偏差降低最高50%,显著提高了各LLM的评估正确性与一致性,并使LLaMA-2-chat在多数领域达到或超越GPT-4性能。在约束故事生成任务中,BSM将故事连贯性提升12%的同时,约束满足率也提高12%。