Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.
翻译:大型语言模型在医疗分诊等高风险领域难以严格遵循结构化工作流程。将完整决策结构编码在单一提示中的整体方法,随着提示长度增加容易出现指令跟随性能退化,包括中间信息丢失效应和上下文窗口溢出。为弥补这一缺陷,我们提出Arbor框架,该框架将决策树导航分解为专门的节点级任务。决策树被标准化为边列表表示并存储以供动态检索。在运行时,基于有向无环图(DAG)的编排机制迭代检索当前节点的出边,通过专用LLM调用评估有效转移,并将响应生成委托给独立的推理步骤。该框架对底层决策逻辑和模型提供商保持不可知性。通过使用真实临床分诊对话的标注话轮,在10个基础模型上与单提示基线进行评估对比。Arbor将平均话轮准确率提升29.4个百分点,每话轮延迟降低57.1%,并实现每话轮成本平均14.4倍的降低。这些结果表明,架构分解降低了对模型内在能力的依赖,使较小模型能够达到或超越在单提示基线条件下运行的较大模型性能。