CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.

翻译：稳定扩散和DALLE-3等文生图模型在多轮图像编辑任务中仍面临困难。我们将此类任务解构为工具使用的智能体工作流（路径），通过不同成本的AI工具依次处理子任务序列。传统搜索算法需进行昂贵探索以寻找工具路径，而大型语言模型（LLM）虽具备子任务规划的先验知识，却可能缺乏对工具能力与成本的准确评估，难以确定各子任务应使用的工具。能否结合LLM与图搜索的优势以寻找高性价比的工具路径？我们提出三阶段方法"CoSTA*"：利用LLM构建子任务树以剪枝任务相关的AI工具图，随后在缩小的子图上执行A*搜索以确定工具路径。为更好平衡总成本与输出质量，CoSTA*综合各工具在每个子任务上的成本与质量指标来指导A*搜索。每个子任务的输出由视觉语言模型（VLM）评估，若失败则触发该工具在对应子任务上成本与质量指标的更新，使A*搜索能快速从失败中恢复并探索其他路径。此外，CoSTA*能自动在子任务间切换模态以实现更优的成本-质量权衡。我们构建了具有挑战性的多轮图像编辑新基准测试，实验表明CoSTA*在成本与质量方面均优于当前最先进的图像编辑模型或智能体，并能根据用户偏好实现灵活权衡。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日