TensorSocket: Shared Data Loading for Deep Learning Training

Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on set of parameters (e.g., hyper-parameter tuning), model architecture (e.g., neural architecture search), among other things that yields the highest accuracy. The computational efficiency of these training tasks depends highly on how well we can supply the training process with training data. The repetitive nature of these tasks results in the same data processing pipelines running over and over exacerbating the need for and costs of computational resources. In this paper, we present Tensorsocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. Tensorsocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. Tensorsocket achieves this by reducing redundant computations across collocated training processes and leveraging modern GPU-GPU interconnects. We demonstrate the hardware- and pipeline-agnostic nature of Tensorsocket and evaluate it using a variety of training scenarios. Our evaluation shows that Tensorsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100\%$, and when utilizing cloud instances, Tensorsocket achieves cost savings of $50\%$ by reducing the hardware resource needs on the CPU side. Furthermore, Tensorsocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader. It is easier to use, maintain, and deploy, and either achieves higher or matches the throughput of other solutions while requiring less CPU resources.

翻译：训练深度学习模型是一个重复且资源密集的过程。数据科学家通常需要训练多个模型才能确定最终参数集（例如超参数调优）、模型架构（例如神经架构搜索）等，以获得最高准确率。这些训练任务的计算效率在很大程度上取决于我们为训练过程提供训练数据的能力。这些任务的重复性导致相同的数据处理管道反复运行，加剧了对计算资源的需求和成本。本文提出TensorSocket，通过使多个并发训练进程共享同一数据加载器来降低深度学习训练的计算需求。当共置的训练工作负载在GPU上具有高吞吐量，但受限于CPU端较低的数据加载吞吐量时，TensorSocket能够缓解CPU侧的瓶颈。其实现方式是通过减少共置训练进程间的冗余计算，并利用现代GPU-GPU互连技术。我们展示了TensorSocket的硬件无关性和管道无关性，并通过多种训练场景进行评估。实验表明，TensorSocket能够实现无数据共享时不可行的训练场景，将训练吞吐量提升最高达$100\%$，并且在云实例上通过降低CPU侧硬件资源需求实现了$50\%$的成本节约。此外，TensorSocket在共享数据加载方面优于现有先进解决方案（如CoorDL和Joader）。它更易于使用、维护和部署，在所需CPU资源更少的同时，吞吐量达到或超过其他解决方案。