This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description of presented strategies is given, highlighting the challenges and benefits of their application. Furthermore, the impact of different dataset types on the tuning process of large language models is investigated. Experiments show to what extent the task type affects the iteration time in a multi-GPU environment, offering valuable insights into the optimal data utilization strategies to improve model performance. Furthermore, this study leverages the built-in parallelization mechanisms of PyTorch that can facilitate these tasks. Furthermore, performance profiling is incorporated into the study to thoroughly evaluate the impact of memory and communication operations during the training/tuning procedure. Test scenarios are developed and tested with numerous benchmarks on the NVIDIA H100 architecture showing efficiency through selected metrics.
翻译:本研究旨在评估图像分类与大语言模型学习及调优过程的并行化性能。针对图像识别中的机器学习模型,基于不同硬件与软件场景开发了多种并行化方法:简单数据并行、分布式数据并行及分布式处理。本文详细阐述了所提出的策略,重点分析了其应用中的挑战与优势。此外,研究还探究了不同数据集类型对大语言模型调优过程的影响。实验揭示了任务类型在多GPU环境中对迭代时间的影响程度,为通过优化数据利用策略提升模型性能提供了重要参考。本研究进一步利用PyTorch内置的并行化机制来促进相关任务执行,并引入性能分析以全面评估训练/调优过程中内存与通信操作的影响。基于NVIDIA H100架构,研究设计并测试了多种基准场景,通过选定指标验证了方案的有效性。