Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. Additionally, performing inference at runtime requires non-trivial coupling of ML framework libraries with simulation codes. This work offers a solution to both limitations by simplifying this coupling and enabling in situ training and inference workflows on heterogeneous clusters. Leveraging SmartSim, the presented framework deploys a database to store data and ML models in memory, thus circumventing the file system. On the Polaris supercomputer, we demonstrate perfect scaling efficiency to the full machine size of the data transfer and inference costs thanks to a novel co-located deployment of the database. Moreover, we train an autoencoder in situ from a turbulent flow simulation, showing that the framework overhead is negligible relative to a solver time step and training epoch.
翻译:近年来,机器学习在流体动力学计算中取得了诸多成功应用。随着仿真规模的扩大,为传统离线学习生成新训练数据集会造成输入/输出和存储瓶颈。此外,在运行时执行推理需要将机器学习框架库与仿真代码进行复杂的耦合。本文通过简化这种耦合并在异构集群上实现原位训练和推理工作流,为上述两个限制提供了解决方案。借助SmartSim,所提出的框架部署数据库将数据和机器学习模型存储在内存中,从而绕过文件系统。在Polaris超级计算机上,得益于新颖的数据库共址部署,我们证明了数据传输和推理成本在整机规模上实现了完美扩展效率。此外,我们通过湍流仿真原位训练了自编码器,表明该框架相对于求解器时间步和训练周期的开销可忽略不计。