The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by CPU-mediated data access patterns. MQMS introduces dynamic address allocation to maximize internal parallelism and fine-grained address mapping to efficiently handle small I/O requests without incurring read-modify-write overheads. Through extensive evaluations on workloads ranging from large language model inference to classical machine learning algorithms, MQMS demonstrates orders-of-magnitude improvements in I/O request throughput, device response time, and simulation end time compared to existing simulators.
翻译:数据密集型机器学习工作负载的指数级增长暴露了传统GPU加速系统的显著局限性,尤其在处理超出GPU DRAM容量的数据集时。我们提出MQMS,一种增强的存储内GPU架构及模拟器,能够感知SSD内部状态与操作,通过智能调度与地址分配来克服由CPU中介数据访问模式导致的性能瓶颈。MQMS引入动态地址分配以最大化内部并行性,并采用细粒度地址映射以高效处理小型I/O请求,避免读-修改-写开销。通过对从大语言模型推理到经典机器学习算法等多种工作负载的广泛评估,MQMS在I/O请求吞吐量、设备响应时间和模拟结束时间方面相比现有模拟器实现了数量级的提升。