State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional prompting to segment tasks into multiple steps, each requiring a round-trip to the GPT APIs, leads to increased system latency and costs. Although recent advancements in parallel function calling have improved tool execution per API call, they may necessitate more detailed in-context instructions and task breakdown at the prompt level, resulting in higher engineering and production costs. Inspired by the hardware design principles of multiply-add (MAD) operations, which fuse multiple arithmetic operations into a single task from the compiler's perspective, we propose LLM-Tool Compiler, which selectively fuses similar types of tool operations under a single function at runtime, presenting them as a unified task to the LLM. This selective fusion inherently enhances parallelization and efficiency. Benchmarked on a large-scale Copilot platform, LLM-Tool Compiler achieves up to four times more parallel calls than existing methods, reducing token costs and latency by up to 40% and 12%, respectively.
翻译:大型语言模型(LLM)中的先进顺序推理能力已将Copilot的功能从对话任务扩展到复杂的函数调用,能够管理数千次API调用。然而,组合式提示倾向于将任务分割为多个步骤,每一步都需要与GPT API进行一次往返通信,这导致系统延迟和成本增加。尽管并行函数调用的最新进展提高了每次API调用的工具执行效率,但它们可能需要在提示层面提供更详细的上下文指令和任务分解,从而导致更高的工程和生产成本。受硬件设计中乘加(MAD)运算原则的启发——该原则从编译器视角将多个算术运算融合为单一任务——我们提出了LLM-工具编译器。该编译器在运行时将有相似类型的工具操作选择性地融合到单一函数下,并将其作为统一任务呈现给LLM。这种选择性融合本质上增强了并行化与效率。在大型Copilot平台上的基准测试表明,LLM-工具编译器实现的并行调用次数可达现有方法的四倍,同时将令牌成本和延迟分别降低了高达40%和12%。