Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan

The introduction of AI and ML technologies into medical devices has revolutionized healthcare diagnostics and treatments. Medical device manufacturers are keen to maximize the advantages afforded by AI and ML by consolidating multiple applications onto a single platform. However, concurrent execution of several AI applications, each with its own visualization components, leads to unpredictable end-to-end latency, primarily due to GPU resource contentions. To mitigate this, manufacturers typically deploy separate workstations for distinct AI applications, thereby increasing financial, energy, and maintenance costs. This paper addresses these challenges within the context of NVIDIA's Holoscan platform, a real-time AI system for streaming sensor data and images. We propose a system design optimized for heterogeneous GPU workloads, encompassing both compute and graphics tasks. Our design leverages CUDA MPS for spatial partitioning of compute workloads and isolates compute and graphics processing onto separate GPUs. We demonstrate significant performance improvements across various end-to-end latency determinism metrics through empirical evaluation with real-world Holoscan medical device applications. For instance, the proposed design reduces maximum latency by 21-30% and improves latency distribution flatness by 17-25% for up to five concurrent endoscopy tool tracking AI applications, compared to a single-GPU baseline. Against a default multi-GPU setup, our optimizations decrease maximum latency by 35% for up to six concurrent applications by improving GPU utilization by 42%. This paper provides clear design insights for AI applications in the edge-computing domain including medical systems, where performance predictability of concurrent and heterogeneous GPU workloads is a critical requirement.

翻译：将AI和ML技术引入医疗设备，彻底革新了医疗诊断与治疗手段。医疗设备制造商渴望通过将多个应用整合至单一平台，最大化AI和ML带来的优势。然而，多个AI应用（各自包含可视化组件）的并发执行会导致不可预测的端到端时延，其主要原因在于GPU资源竞争。为缓解这一问题，制造商通常为不同AI应用部署独立工作站，从而增加了财务、能源与维护成本。本文针对NVIDIA的Holoscan平台（一个用于流式传感器数据和图像的实时AI系统）中的这些挑战展开研究。我们提出了一种针对异构GPU工作负载（涵盖计算与图形任务）优化的系统设计。该设计利用CUDA MPS实现计算工作负载的空间分区，并将计算与图形处理隔离至不同GPU。通过基于真实Holoscan医疗设备应用的经验评估，我们证明了该设计在多种端到端时延确定性指标上的显著性能提升。例如，与单GPU基线相比，所提设计在最多五个并发内窥镜器械跟踪AI应用场景下，将最大时延降低了21-30%，并将时延分布平坦度提升了17-25%。相较于默认的多GPU配置，我们的优化通过将GPU利用率提升42%，在最多六个并发应用场景下将最大时延降低了35%。本文为边缘计算领域（包括医疗系统）中的AI应用提供了清晰的设计见解，这些领域对并发及异构GPU工作负载的性能可预测性有着严格要求。