Blueprint First, Model Second: A Framework for Deterministic LLM Workflow

While powerful, the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements. This limitation stems from current architectures that conflate probabilistic, high-level planning with low-level action execution within a single generative process. To address this, we introduce the \textsc{Source Code Agent} framework, a new paradigm built on the ``Blueprint First, Model Second'' philosophy that decouples workflow logic from the generative model. An expert-defined operational procedure is first codified into a source code-based Execution Blueprint, which is then executed by a deterministic engine. The LLM is strategically invoked as a specialized tool to handle bounded, complex sub-tasks within the workflow, but never to decide the workflow's path. We evaluate on the TravelPlanner benchmark for constraint-aware travel planning. The \textsc{Source Code Agent} achieves a 35.56\% final pass rate, a 97.6\% improvement over the state-of-the-art ATLAS baseline (18.00\%) on the same Claude-Sonnet-4 backbone. Critically, it reduces constraint violations by 96.0\% (11 vs 275) while improving execution efficiency by 27.1\% (10.2$\pm$0.7 steps vs 14.0). Two production incident-diagnosis deployments and additional results on ScienceWorld and ALFWorld confirm that the architecture transfers beyond travel planning to procedurally well-defined, constraint-intensive workflows. Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.

翻译：尽管强大，但大语言模型（LLM）智能体固有的非确定性限制了其在结构化操作环境中的应用，这些环境对过程保真性和可预测执行有严格要求。这一限制源于当前架构将概率性高层规划与低层行动执行混合在单一生成过程中的问题。为解决此问题，我们提出了源代码智能体（\textsc{Source Code Agent}）框架，这是一种基于"蓝图先行，模型随后"哲学的新范式，将工作流逻辑与生成模型解耦。专家定义的操作流程首先被编纂为基于源代码的执行蓝图，随后由确定性引擎执行。大语言模型被策略性地作为专门工具调用，以处理工作流中有界的复杂子任务，但从不决定工作流的路径。我们在TravelPlanner基准上针对约束感知的旅行规划进行评估。在相同的Claude-Sonnet-4骨干网络下，源代码智能体取得了35.56%的最终通过率，较当前最优的ATLAS基线（18.00%）提升了97.6%。关键在于，它将约束违规减少了96.0%（11次 vs 275次），同时将执行效率提高了27.1%（10.2$\pm$0.7步 vs 14.0步）。两个生产事故诊断部署以及ScienceWorld和ALFWorld上的额外结果表明，该架构可超越旅行规划领域，迁移至过程定义明确、约束密集的工作流中。我们的工作使得在遵循严格过程逻辑的应用中，能够实现可验证且可靠的自主智能体部署。