Hybrid Edge-HPC Systems for Low-Latency Data-Driven Inference

Liubov Kurafeeva,Ryan Hartung,Benjamin Carter,Alan Subedi,Avhishek Biswas,Michael Fay,Shantenu Jha,Chandra Krintz,Andre Merzky,Douglas Thain,Memet Can Vuran,Rich Wolski

Emerging cyber-physical systems increasingly require low-latency inference from streaming sensor data while maintaining models that reflect complex and evolving physical processes. In many domains, however, model updates depend on high-fidelity simulations and training executed on remote high-performance computing (HPC) systems under batch scheduling. This creates a fundamental mismatch between the responsiveness required at the edge and the cost, throughput, and availability of simulation-driven model updates. We present RBF (Reverse Backfill), a hybrid edge-HPC learning and inference architecture that integrates low-latency edge inference with asynchronous, simulation-driven model improvement. RBF targets simulation-bounded settings in which model updates are constrained by simulation throughput and HPC scheduling delays, and reinterprets HPC backfilling by using opportunistic computation to improve model accuracy rather than system utilization. RBF decouples inference from simulation and training by deploying lightweight surrogate models at the edge while incorporating improved models asynchronously as they become available. The architecture supports pluggable surrogate models and orchestrates computation across heterogeneous infrastructure spanning edge devices, private 5G, cloud, and HPC resources. We instantiate RBF using a real-world digital agriculture deployment that couples edge sensing with computational fluid dynamics (CFD) simulations to infer airflow patterns in a large agricultural screenhouse. Our evaluation characterizes end-to-end system behavior under realistic constraints, quantifying simulation latency, training cost, inference throughput, and the impact of delayed model updates on prediction accuracy. Results demonstrate that RBF enables continuous, low-latency inference while improving model fidelity over time despite delayed and irregular model updates.

翻译：新兴的网络物理系统日益要求对流式传感器数据进行低延迟推理，同时维护能够反映复杂且不断演化的物理过程的模型。然而在许多领域，模型更新依赖于在远程高性能计算（HPC）系统上以批量调度方式执行的高保真仿真与训练。这导致了边缘端所需的响应能力与仿真驱动模型更新的成本、吞吐量和可用性之间的根本性不匹配。我们提出RBF（反向回填）——一种混合边缘-HPC学习与推理架构，它集成了低延迟边缘推理与异步、仿真驱动的模型改进。RBF针对仿真受限场景（其中模型更新受仿真吞吐量和HPC调度延迟的约束），重新诠释了HPC回填机制——利用机会计算提升模型精度而非系统利用率。RBF通过在边缘部署轻量级代理模型实现推理与仿真、训练的分离，同时异步集成可用时的改进模型。该架构支持可插拔代理模型，并协调横跨边缘设备、私有5G、云端和HPC资源的异构基础设施上的计算。我们基于实际数字农业部署实例化了RBF，该部署将边缘传感与计算流体动力学（CFD）仿真相结合，以推断大型农业遮阳网室内的气流模式。我们的评估在现实约束下刻画了端到端系统行为，量化了仿真延迟、训练成本、推理吞吐量以及延迟模型更新对预测精度的影响。结果表明，尽管存在延迟且不规则的模型更新，RBF仍能实现持续的低延迟推理，同时随时间推移提升模型保真度。