The rapid adoption of large language models and multimodal foundation models has made multimodal data preparation pipelines critical AI infrastructure. These pipelines interleave CPU-heavy preprocessing with accelerator-backed (GPU/NPU/TPU) inference and produce massive intermediate artifacts. Achieving high throughput is difficult because workloads are highly non-stationary: regime shifts, input-dependent inference, and transient memory spikes cause rapid performance fluctuations and out-of-memory (OOM) failures. Existing schedulers typically rely on threshold-based autoscaling or assume synchronous, homogeneous operators, leading to poor efficiency. We present Trident, an adaptive scheduling framework for heterogeneous multimodal pipelines on fixed-resource clusters. Trident closes the loop across three coupled layers: (i) an observation layer that estimates per-operator sustainable throughput for asynchronous operators via Gaussian Process regression with anomaly filtering; (ii) an adaptation layer that detects workload shifts online and performs memory-constrained Bayesian optimization to recommend OOM-safe configurations; and (iii) a scheduling layer that solves a mixed-integer linear program to jointly optimize operator parallelism, placement, and configuration transitions under heterogeneous compute and bandwidth constraints, accounting for cold-start overhead via rolling updates. Decisions trigger sample invalidation and model refresh to keep estimates consistent with the active configuration. Implemented on Ray Data, Trident improves end-to-end throughput by up to 2.01x on a document curation (PDF) pipeline and 1.88x on a video curation pipeline over a static baseline, with low overhead suitable for online re-optimization.
翻译:随着大语言模型和多模态基础模型的迅速普及,多模态数据准备管道已成为关键的人工智能基础设施。这些管道将CPU密集型预处理与加速器(GPU/NPU/TPU)支持的推理操作交错执行,并产生海量中间产物。由于工作负载具有高度非平稳性——包括模态切换、输入依赖的推理以及瞬态内存峰值等因素导致性能快速波动和内存溢出(OOM)故障——实现高吞吐量变得极为困难。现有调度器通常依赖基于阈值的自动扩缩机制或假设算子为同步同构的,导致效率低下。本文提出三叉戟(Trident),一种面向固定资源集群上异构多模态管道的自适应调度框架。该框架通过三个耦合层形成闭环控制:(i)观测层:采用结合异常过滤的高斯过程回归方法,为异步算子估计可持续的每算子吞吐量;(ii)适配层:在线检测工作负载变化,通过内存约束的贝叶斯优化推荐防OOM的配置方案;(iii)调度层:求解混合整数线性规划问题,在异构计算与带宽约束下联合优化算子并行度、部署位置及配置切换策略,并通过滚动更新机制考虑冷启动开销。调度决策会触发样本失效与模型更新,确保估计值与当前运行配置保持一致。基于Ray Data实现的系统在文档处理(PDF)管道上相比静态基准实现了最高2.01倍的端到端吞吐量提升,在视频处理管道上达到1.88倍提升,且其低开销特性适合在线重优化场景。