Throughout the scientific computing space, deep learning algorithms have shown excellent performance in a wide range of applications. As these deep neural networks (DNNs) continue to mature, the necessary compute required to train them has continued to grow. Today, modern DNNs require millions of FLOPs and days to weeks of training to generate a well-trained model. The training times required for DNNs are oftentimes a bottleneck in DNN research for a variety of deep learning applications, and as such, accelerating and scaling DNN training enables more robust and accelerated research. To that end, in this work, we explore utilizing the NRP Nautilus HyperCluster to automate and scale deep learning model training for three separate applications of DNNs, including overhead object detection, burned area segmentation, and deforestation detection. In total, 234 deep neural models are trained on Nautilus, for a total time of 4,040 hours
翻译:在科学计算领域,深度学习算法已在广泛的应用中展现出卓越性能。随着深度神经网络(DNNs)的持续发展,训练它们所需的计算量也在不断增长。如今,现代DNNs需要数百万次浮点运算(FLOPs)以及数天至数周的训练时间才能生成一个训练良好的模型。对于各类深度学习应用而言,DNNs所需的训练时间常常成为研究瓶颈,因此,加速和扩展DNN训练能够推动更稳健、更快速的研究进程。为此,本研究探索利用NRP Nautilus超算集群,针对DNNs的三个独立应用(包括航空目标检测、火烧迹地分割和森林砍伐检测)实现深度学习模型训练的自动化与规模化扩展。总计有234个深度神经网络模型在Nautilus集群上完成训练,累计耗时4,040小时。