LLM-driven agentic applications increasingly automate complex, multi-step tasks, but serving them efficiently remains challenging due to heterogeneous components, dynamic and model-driven control flow, long-running state, and unpredictable latencies. Nalar is a ground-up agent-serving framework that cleanly separates workflow specification from execution while providing the runtime visibility and control needed for robust performance. Nalar preserves full Python expressiveness, using lightweight auto-generated stubs that turn agent and tool invocations into futures carrying dependency and context metadata. A managed state layer decouples logical state from physical placement, enabling safe reuse, migration, and consistent retry behavior. A two-level control architecture combines global policy computation with local event-driven enforcement to support adaptive routing, scheduling, and resource management across evolving workflows. Together, these mechanisms allow Nalar to deliver scalable, efficient, and policy-driven serving of heterogeneous agentic applications without burdening developers with orchestration logic. Across three agentic workloads, Nalar cuts tail latency by 34--74\%, achieves up to $2.9\times$ speedups, sustains 80 RPS where baselines fail, and scales to 130K futures with sub-500 ms control overhead.
翻译:基于大语言模型的智能体应用日益自动化复杂的多步骤任务,但由于其异构组件、动态且模型驱动的控制流、长时运行状态以及不可预测的延迟,实现高效服务仍面临挑战。Nalar 是一个从零构建的智能体服务框架,它清晰地将工作流规范与执行分离,同时为鲁棒性能提供所需的运行时可见性与控制能力。Nalar 保留了完整的 Python 表达力,通过轻量级自动生成的存根将智能体与工具调用转化为携带依赖关系和上下文元数据的未来对象。一个托管状态层将逻辑状态与物理部署解耦,从而支持安全的重用、迁移和一致的重试行为。双层控制架构将全局策略计算与本地事件驱动执行相结合,以支持在不断演进的工作流中实现自适应路由、调度与资源管理。这些机制共同使 Nalar 能够提供可扩展、高效且策略驱动的异构智能体应用服务,而无需开发者承担编排逻辑的负担。在三种智能体工作负载上,Nalar 将尾部延迟降低了 34–74%,实现了最高 $2.9\times$ 的加速,在基线系统失效时仍能维持 80 RPS,并可扩展至 13 万个未来对象且控制开销低于 500 毫秒。