In the last decades, the computational power of GPUs has grown exponentially, allowing current deep learning (DL) applications to handle increasingly large amounts of data at a progressively higher throughput. However, network and storage latencies cannot decrease at a similar pace due to physical constraints, leading to data stalls, and creating a bottleneck for DL tasks. Additionally, managing vast quantities of data and their associated metadata has proven challenging, hampering and slowing the productivity of data scientists. Moreover, existing data loaders have limited network support, necessitating, for maximum performance, that data be stored on local filesystems close to the GPUs, overloading the storage of computing nodes. In this paper we propose a strategy, aimed at DL image applications, to address these challenges by: storing data and metadata in fast, scalable NoSQL databases; connecting the databases to state-of-the-art loaders for DL frameworks; enabling high-throughput data loading over high-latency networks through our out-of-order, incremental prefetching techniques. To evaluate our approach, we showcase our implementation and assess its data loading capabilities through local, medium and high-latency (intercontinental) experiments.
翻译:过去几十年间,GPU的计算能力呈指数级增长,使得当前深度学习应用能够处理日益庞大的数据量并实现持续提升的吞吐量。然而,由于物理限制,网络和存储延迟无法以相似的速度降低,导致数据停滞现象,成为深度学习任务中的性能瓶颈。此外,海量数据及其关联元数据的管理已被证明具有挑战性,阻碍并降低了数据科学家的工作效率。现有数据加载器的网络支持能力有限,为获得最佳性能,数据必须存储在靠近GPU的本地文件系统中,这导致计算节点的存储负载过重。本文针对深度学习图像应用提出一种综合策略,通过以下方式应对这些挑战:将数据与元数据存储于快速、可扩展的NoSQL数据库中;将数据库与深度学习框架的先进数据加载器相连接;采用我们提出的乱序增量预取技术,实现高延迟网络环境下的高吞吐量数据加载。为评估该方案,我们展示了具体实现,并通过本地、中延迟及高延迟(跨大陆)实验系统评估了其数据加载能力。