Large language model (LLM) applications are increasingly executed as heterogeneous multi-stage workflows rather than isolated inference calls. In these workflow directed acyclic graphs (DAGs), scheduling decisions affect not only the currently ready stage, but also the execution state inherited by downstream stages, including model residency, parent-output locality, prefix reuse, and future device reachability. Existing serving and DAG-scheduling policies mainly optimize immediate queue state, placement cost, or reuse signals in isolation, which can fragment useful state and increase end-to-end latency. We present FATE, a future-state-aware scheduler for heterogeneous LLM workflows. FATE combines a CP-SAT-backed frontier planner, horizon-aware candidate scoring, bounded multi-device shard execution, and state-conditional cost estimation. Rather than solving a monolithic full-DAG problem, FATE repeatedly plans over the current ready frontier and scores assignments by both immediate cost and the downstream state they induce. Across real-DAG and controlled prefix-reuse benchmarks, FATE outperforms practical heuristics, classical DAG scheduling, and proxy adaptations of recent workflow-serving policies. On the real-DAG benchmark, it achieves normalized makespan and normalized P95 latency of 0.675 and 0.677, reducing them by 32.5% and 32.3% over RoundRobin and by 8.9% and 8.8% over the strongest non-FATE baseline. Mechanism analysis and ablations show that these gains arise from jointly preserving multiple dimensions of future execution state rather than prefix reuse alone. These results indicate that future-state preservation should be treated as a first-class scheduling objective for heterogeneous LLM workflow serving.
翻译:大语言模型应用正越来越多地以异构多阶段工作流而非独立推理调用的形式执行。在这些工作流有向无环图中,调度决策不仅影响当前就绪的阶段,还会影响下游阶段继承的执行状态,包括模型驻留、父节点输出局部性、前缀复用以及未来设备可达性。现有的服务系统与DAG调度策略主要孤立地优化即时队列状态、部署成本或复用信号,这可能导致有用状态碎片化并增加端到端延迟。我们提出FATE,一种面向异构大语言模型工作流的未来状态感知调度器。FATE结合了基于CP-SAT的前沿规划器、带有视界感知的候选评分、有界多设备分片执行以及状态条件化成本估计。FATE不解决整体全DAG问题,而是反复在当前就绪前沿进行规划,并根据即时成本和它们所诱导的下游状态对分配进行评分。在真实DAG和受控前缀复用基准测试上,FATE优于实用启发式方法、经典DAG调度以及近期工作流服务策略的代理适配版本。在真实DAG基准测试上,FATE实现了0.675的归一化完工时间和0.677的归一化P95延迟,相较于轮询调度分别降低了32.5%和32.3%,相较于最强的非FATE基线分别降低了8.9%和8.8%。机制分析与消融实验表明,这些收益源于联合保护未来执行状态的多个维度,而非仅依赖前缀复用。这些结果表明,未来状态保护应被视为异构大语言模型工作流服务的首要调度目标。