This paper presents a comparative analysis of distributed training strategies for large-scale neural networks, focusing on data parallelism, model parallelism, and hybrid approaches. We evaluate these strategies on image classification tasks using the CIFAR-100 dataset, measuring training time, convergence rate, and model accuracy. Our experimental results demonstrate that hybrid parallelism achieves a 3.2x speedup compared to single-device training while maintaining comparable accuracy. We propose an adaptive scheduling algorithm that dynamically switches between parallelism strategies based on network characteristics and available computational resources, resulting in an additional 18% improvement in training efficiency.
翻译:本文对大规模神经网络的分布式训练策略进行了比较分析,重点研究了数据并行、模型并行及混合并行方法。我们在CIFAR-100数据集上通过图像分类任务评估了这些策略,测量了训练时间、收敛速度和模型精度。实验结果表明,混合并行方法相比单设备训练实现了3.2倍的加速比,同时保持了相当的精度。我们提出了一种自适应调度算法,该算法能根据网络特性和可用计算资源动态切换并行策略,使训练效率额外提升了18%。