Design Methodology and Performance Trade-offs Management for Distributed and Compound AI Systems

Artificial Intelligence (AI) systems must typically satisfy service-level objectives including accuracy, latency, and cost. The prevailing model-centric approaches select a monolithic model at design time and apply identical computation regardless of input difficulty, cannot decompose tasks across specialized components, and have knowledge that is fixed at training time. During runtime, this can lead to performance degradation and increasing costs. Because the model is the main design variable, it determines the majority of system behavior, coupling operational objectives to a single design-time choice. Addressing these limitations requires shifting from model-centric to system-centric design. Compound AI systems realize this shift by orchestrating multiple models, algorithms, and tools as distributed AI systems through explicit control logic. The performance of such systems depends on their workflow topology, the models assigned to each task, and the parameters governing runtime behavior. We present a design methodology that organizes this space along two dimensions, workflow topology and configuration selection, and identifies eight design patterns, each consolidating techniques to address a specific limitation of monolithic deployment. We validate our methodology through three case studies. Across our case studies, Compound AI configurations approach accuracy of monolithic models within 2.5 to 4 percentage points while reducing latency by up to 60% and cost by up to 71%. We show that model selection and parameter configuration jointly determine system performance, but the resulting design space grows combinatorially, as workflows compose more patterns and components. Thus, we identify five open challenges that define a roadmap from manually configured prototypes towards systems that automatically discover and maintain SLO-compliance in Compound and Distributed AI systems.

翻译：人工智能系统通常需要满足包括准确率、时延和成本在内的服务等级目标。当前以模型为中心的方法在设计时选择单一模型，对各类输入施加相同计算量，无法将任务分解至专用组件，且其知识在训练时即被固化。运行时这可能导致性能退化与成本攀升。由于模型是主要设计变量，它决定了系统行为的绝大部分，将运营目标耦合至单一设计时选择。解决这些局限需要从模型中心转向系统中心设计。复合AI系统通过显式控制逻辑编排多个模型、算法与工具，作为分布式AI系统实现这一转变。此类系统的性能取决于其工作流拓扑、各任务分配的模型以及运行时行为控制参数。我们提出一个沿工作流拓扑与配置选择两个维度组织设计空间的方法论，识别出八种设计模式——每种模式整合了解决单体部署特定局限的技术。通过三个案例研究验证该方法论。在案例中，复合AI配置在准确率上与单体模型差距控制在2.5至4个百分点内，同时将时延降低最高60%、成本降低最高71%。我们证明模型选择与参数配置共同决定系统性能，但当工作流组合更多模式与组件时，设计空间呈组合式增长。因此，我们提出五个开放挑战，勾勒出从手动配置原型向自动发现并维护复合与分布式AI系统服务等级目标系统的路线图。