PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

Modern deep neural network (DNN) applications integrate multiple DNN models into inference pipelines with stringent latency requirements for customized tasks. To mitigate extensive request timeouts caused by accumulation, systems for inference pipelines commonly drop a subset of requests so the remaining ones can satisfy latency constraints. Since it is commonly believed that request dropping adversely affects goodput, existing systems only drop requests when they have to, which we call reactive dropping. However, this reactive policy can not maintain high goodput, as it neither makes timely dropping decisions nor identifies the proper set of requests to drop, leading to issues of dropping requests too late or dropping the wrong set of requests. We propose that the inference system should proactively drop certain requests in advance to enhance the goodput across the entire workload. To achieve this, we design an inference system PARD. It enhances goodput with timely and precise dropping decisions by integrating a proactive dropping method that decides when to drop requests using runtime information of the inference pipeline, and an adaptive request priority mechanism that selects which specific requests to drop based on remaining latency budgets and workload intensity. Evaluation on a cluster of 64 GPUs over real-world workloads shows that PARD achieves $16\%$-$176\%$ higher goodput than the state of the art while reducing the drop rate and wasted computation resources by $1.6\times$-$17\times$ and $1.5\times$-$62\times$ respectively.

翻译：现代深度神经网络（DNN）应用将多个DNN模型集成到推理流水线中，以满足定制化任务的严格延迟要求。为缓解因请求累积导致的广泛超时，推理流水线系统通常丢弃部分请求，使剩余请求能够满足延迟约束。由于普遍认为请求丢弃会对有效吞吐量产生不利影响，现有系统仅在不得不丢弃时才执行丢弃操作，我们称之为被动丢弃。然而，这种被动策略无法维持较高的有效吞吐量，因为它既不能及时做出丢弃决策，也无法识别应丢弃的合适请求集合，从而导致丢弃过晚或丢弃错误请求集的问题。我们提出推理系统应主动提前丢弃特定请求，以提升整个工作负载的有效吞吐量。为实现这一目标，我们设计了推理系统PARD。该系统通过集成一种主动丢弃方法（利用推理流水线的运行时信息决定何时丢弃请求）和一种自适应请求优先级机制（基于剩余延迟预算和工作负载强度选择具体丢弃哪些请求），以及时、精确的丢弃决策提升有效吞吐量。在64个GPU集群上对真实工作负载的评估表明，与现有最优方案相比，PARD实现了有效吞吐量提升$16\%$-$176\%$，同时将丢弃率和计算资源浪费分别降低了$1.6\times$-$17\times$和$1.5\times$-$62\times$。