Dual-pronged deep learning preprocessing on heterogeneous platforms with CPU, Accelerator and CSD

For image-related deep learning tasks, the first step often involves reading data from external storage and performing preprocessing on the CPU. As accelerator speed increases and the number of single compute node accelerators increases, the computing and data transfer capabilities gap between accelerators and CPUs gradually increases. Data reading and preprocessing become progressively the bottleneck of these tasks. Our work, DDLP, addresses the data computing and transfer bottleneck of deep learning preprocessing using Computable Storage Devices (CSDs). DDLP allows the CPU and CSD to efficiently parallelize preprocessing from both ends of the datasets, respectively. To this end, we propose two adaptive dynamic selection strategies to make DDLP control the accelerator to automatically read data from different sources. The two strategies trade-off between consistency and efficiency. DDLP achieves sufficient computational overlap between CSD data preprocessing and CPU preprocessing, accelerator computation, and accelerator data reading. In addition, DDLP leverages direct storage technology to enable efficient SSD-to-accelerator data transfer. In addition, DDLP reduces the use of expensive CPU and DRAM resources with more energy-efficient CSDs, alleviating preprocessing bottlenecks while significantly reducing power consumption. Extensive experimental results show that DDLP can improve learning speed by up to 23.5% on ImageNet Dataset while reducing energy consumption by 19.7% and CPU and DRAM usage by 37.6%. DDLP also improves the learning speed by up to 27.6% on the Cifar-10 dataset.

翻译：对于图像相关的深度学习任务，第一步通常涉及从外部存储读取数据并在CPU上进行预处理。随着加速器速度的提升以及单计算节点加速器数量的增加，加速器与CPU之间的计算和数据传输能力差距逐渐扩大。数据读取和预处理日益成为这些任务的瓶颈。我们的工作DDLP利用可计算存储设备（CSD）来解决深度学习预处理的数据计算与传输瓶颈。DDLP允许CPU和CSD分别从数据集的两端高效并行化预处理。为此，我们提出了两种自适应动态选择策略，使DDLP能够控制加速器自动从不同数据源读取数据。这两种策略在一致性与效率之间进行权衡。DDLP实现了CSD数据预处理与CPU预处理、加速器计算以及加速器数据读取之间充分的计算重叠。此外，DDLP利用直接存储技术实现了高效的SSD到加速器的数据传输。同时，DDLP通过使用能效更高的CSD减少了对昂贵CPU和DRAM资源的使用，在缓解预处理瓶颈的同时显著降低了功耗。大量实验结果表明，DDLP在ImageNet数据集上可将学习速度提升高达23.5%，同时降低能耗19.7%，并减少CPU和DRAM使用量37.6%。在Cifar-10数据集上，DDLP亦能将学习速度提升高达27.6%。