Data loaders are used by Machine Learning (ML) frameworks like PyTorch and TensorFlow to apply transformations to data before feeding it into the accelerator. This operation is called data preprocessing. Data preprocessing plays an important role in the ML training workflow because if it is inefficiently pipelined with the training, it can yield high GPU idleness, resulting in important training delays. Unfortunately, existing data loaders turn out to waste GPU resources, with $76\%$ GPU idleness when using the PyTorch data loader, for example. One key source of inefficiency is the variability in preprocessing time across samples within the same dataset. Existing data loaders are oblivious to this variability, and they construct batches without any consideration of slow or fast samples. In this case, the entire batch is delayed by a single slow sample, stalling the training pipeline and resulting in head-of-line blocking. To address these inefficiencies, we present MinatoLoader, a general-purpose data loader for PyTorch that accelerates training and improves GPU utilization. MinatoLoader is designed for a single-server setup, containing multiple GPUs. It continuously prepares data in the background and actively constructs batches by prioritizing fast-to-preprocess samples, while slower samples are processed in parallel. We evaluate MinatoLoader on servers with V100 and A100 GPUs. On a machine with four A100 GPUs, MinatoLoader improves the training time of a wide range of workloads by up to $7.5\times$ ($3.6\times$ on average) over PyTorch DataLoader and Pecan, and up to $3\times$ ($2.2\times$ on average) over DALI. It also increases average GPU utilization from 46.4\% with PyTorch to 90.45\%, while preserving model accuracy and enabling faster convergence.
翻译:数据加载器被PyTorch和TensorFlow等机器学习框架用于在将数据送入加速器之前对其应用变换。此操作称为数据预处理。数据预处理在机器学习训练工作流中扮演着重要角色,因为如果其与训练的流水线编排效率低下,会导致GPU高闲置率,从而造成显著的训练延迟。遗憾的是,现有数据加载器被证明会浪费GPU资源,例如,使用PyTorch数据加载器时GPU闲置率高达$76\\%$。一个关键的效率低下根源在于同一数据集中不同样本的预处理时间存在变异性。现有数据加载器对此变异性无感知,它们在构建批次时完全不考虑慢速或快速样本。在这种情况下,整个批次会被单个慢速样本所延迟,导致训练流水线停滞,产生队头阻塞。为解决这些低效问题,我们提出了MinatoLoader,一个用于PyTorch的通用数据加载器,旨在加速训练并提高GPU利用率。MinatoLoader专为包含多个GPU的单服务器设置而设计。它在后台持续准备数据,并通过优先处理预处理快速的样本来主动构建批次,同时并行处理较慢的样本。我们在配备V100和A100 GPU的服务器上评估了MinatoLoader。在一台配备四个A100 GPU的机器上,相较于PyTorch DataLoader和Pecan,MinatoLoader将多种工作负载的训练时间提升了高达$7.5\\times$(平均$3.6\\times$);相较于DALI,提升了高达$3\\times$(平均$2.2\\times$)。它还将平均GPU利用率从使用PyTorch时的46.4%提高到了90.45%,同时保持了模型精度并实现了更快的收敛。