Bayesian optimization (BO) is a promising approach for hyperparameter optimization of deep neural networks (DNNs), where each model training can take minutes to hours. In BO, a computationally cheap surrogate model is employed to learn the relationship between parameter configurations and their performance such as accuracy. Parallel BO methods often adopt single manager/multiple workers strategies to evaluate multiple hyperparameter configurations simultaneously. Despite significant hyperparameter evaluation time, the overhead in such centralized schemes prevents these methods to scale on a large number of workers. We present an asynchronous-decentralized BO, wherein each worker runs a sequential BO and asynchronously communicates its results through shared storage. We scale our method without loss of computational efficiency with above 95% of worker's utilization to 1,920 parallel workers (full production queue of the Polaris supercomputer) and demonstrate improvement in model accuracy as well as faster convergence on the CANDLE benchmark from the Exascale computing project.
翻译:贝叶斯优化(BO)是一种对深度神经网络(DNNs)进行超参数优化的有前景方法,其中每个模型训练可能需要数分钟到数小时。在BO中,采用计算成本低廉的代理模型来学习参数配置与其性能(如准确率)之间的关系。并行BO方法通常采用单一管理器/多工作节点策略,同时评估多个超参数配置。尽管显著缩短了超参数评估时间,但这种集中式方案的开销限制了其在大量工作节点上的扩展性。我们提出了一种异步去中心化BO方法,其中每个工作节点运行顺序BO并通过共享存储异步通信其结果。我们以超过95%的工作节点利用率将方法无计算效率损失地扩展到1,920个并行工作节点(北极星超级计算机的完整生产队列),并在百亿亿次计算项目的CANDLE基准测试上展示了模型准确率的提升以及更快的收敛速度。