Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.
翻译:稳定扩散和DALLE-3等文生图模型在多轮图像编辑任务中仍面临困难。我们将此类任务解构为工具使用的智能体工作流(路径),通过不同成本的AI工具依次处理子任务序列。传统搜索算法需进行昂贵探索以寻找工具路径,而大型语言模型(LLM)虽具备子任务规划的先验知识,却可能缺乏对工具能力与成本的准确评估,难以确定各子任务应使用的工具。能否结合LLM与图搜索的优势以寻找高性价比的工具路径?我们提出三阶段方法"CoSTA*":利用LLM构建子任务树以剪枝任务相关的AI工具图,随后在缩小的子图上执行A*搜索以确定工具路径。为更好平衡总成本与输出质量,CoSTA*综合各工具在每个子任务上的成本与质量指标来指导A*搜索。每个子任务的输出由视觉语言模型(VLM)评估,若失败则触发该工具在对应子任务上成本与质量指标的更新,使A*搜索能快速从失败中恢复并探索其他路径。此外,CoSTA*能自动在子任务间切换模态以实现更优的成本-质量权衡。我们构建了具有挑战性的多轮图像编辑新基准测试,实验表明CoSTA*在成本与质量方面均优于当前最先进的图像编辑模型或智能体,并能根据用户偏好实现灵活权衡。