How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user's specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM's ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using a new evaluation set, called CustomPlans, of over 200 WikiHow procedures each with a customization need. We find that a simple architecture with two LLM agents used sequentially performs best, one that edits a generic how-to procedure and one that verifies its executability, significantly outperforming (10.5% absolute) an end-to-end prompted LLM. This suggests that LLMs can be configured reasonably effectively for procedure customization. This also suggests that multi-agent editing architectures may be worth exploring further for other customization applications (e.g. coding, creative writing) in the future.
翻译:如何执行步骤流程(如如何种植花园)现在被数百万用户使用,但有时需要根据用户的特定需求进行定制,例如无农药种植。我们的目标是衡量并提升大语言模型(LLM)执行此类定制的能力。我们的方法是通过一个新的评估集CustomPlans(包含超过200个WikiHow流程,每个均附带定制需求),测试多种简单的多LLM智能体架构以及端到端LLM的定制效果。研究发现,一种由两个LLM智能体顺序协作的简单架构效果最佳:一个负责编辑通用步骤流程,另一个验证其可执行性,其性能显著优于端到端提示式LLM(绝对提升10.5%)。这表明LLM能够以合理有效的方式配置用于流程定制,同时也提示多智能体编辑架构或值得在未来的其他定制应用(如编程、创意写作)中进一步探索。