The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a prototype of a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.
翻译:AI硬件热潮推动现代数据中心采用以分布式、GPU为核心的高性能计算架构。由快速RDMA网络互联、高带宽NVMe存储支撑的大型GPU集群,实现了可扩展计算与存储驻留数据的快速访问。专为AI工作负载设计的张量计算运行时(如PyTorch),近期已被证明可加速分析型负载。然而,先前研究主要考虑数据适配于聚合GPU内存的场景。本文系统探究了张量计算运行时如何支持面向大规模存储驻留OLAP工作负载的可扩展分布式查询处理。尽管张量计算运行时提供了网络与存储I/O的抽象机制,但由于计算与数据移动之间缺乏充分交叠,直接使用往往导致GPU及I/O带宽利用率不足。作为核心贡献,我们提出PystachIO——一个基于PyTorch的分布式OLAP引擎原型,通过融合高速网络与存储I/O及关键优化技术,最大化GPU、网络与存储利用率。实验评估表明,与现有基于GPU的分布式查询处理方法相比,端到端性能提升可达3倍。