Distributed applications increasingly demand low end-to-end latency, especially in edge and cloud environments where co-located workloads contend for limited resources. Traditional load-balancing strategies are typically reactive and rely on outdated or coarse-grained metrics, often leading to suboptimal routing decisions and increased tail latencies. This paper investigates the use of round-trip time (RTT) predictors to enhance request routing by anticipating application latency. We develop lightweight and accurate RTT predictors that are trained on time-series monitoring data collected from a Kubernetes-managed GPU cluster. By leveraging a reduced set of highly correlated monitoring metrics, our approach maintains low overhead while remaining adaptable to diverse co-location scenarios and heterogeneous hardware. The predictors achieve up to 95% accuracy while keeping the prediction delay within 10% of the application RTT. In addition, we identify the minimum prediction accuracy threshold and key system-level factors required to ensure effective predictor deployment in resource-constrained clusters. Simulation-based evaluation demonstrates that performance-aware load balancing can significantly reduce application RTT and minimize resource waste. These results highlight the feasibility of integrating predictive load balancing into future production systems.
翻译:分布式应用对端到端低延迟的需求日益增长,尤其在边缘和云环境中,共置工作负载会竞争有限的资源。传统的负载均衡策略通常是被动的,且依赖于过时或粗粒度的度量指标,这往往导致次优的路由决策并增加尾部延迟。本文研究利用往返时间(RTT)预测器来通过预测应用延迟以增强请求路由。我们开发了轻量级且准确的RTT预测器,这些预测器基于从Kubernetes管理的GPU集群收集的时间序列监控数据进行训练。通过利用一组精简且高度相关的监控指标,我们的方法在保持低开销的同时,仍能适应多样化的共置场景和异构硬件。该预测器在将预测延迟控制在应用RTT的10%以内的同时,实现了高达95%的准确率。此外,我们确定了确保预测器在资源受限集群中有效部署所需的最低预测准确率阈值以及关键的系统级因素。基于仿真的评估表明,性能感知的负载均衡能显著降低应用RTT并最小化资源浪费。这些结果凸显了将预测性负载均衡集成到未来生产系统中的可行性。