Recent years have seen a surge in deep learning approaches to accelerate numerical solvers, which provide faithful but computationally intensive simulations of the physical world. These deep surrogates are generally trained in a supervised manner from limited amounts of data slowly generated by the same solver they intend to accelerate. We propose an open-source framework that enables the online training of these models from a large ensemble run of simulations. It leverages multiple levels of parallelism to generate rich datasets. The framework avoids I/O bottlenecks and storage issues by directly streaming the generated data. A training reservoir mitigates the inherent bias of streaming while maximizing GPU throughput. Experiment on training a fully connected network as a surrogate for the heat equation shows the proposed approach enables training on 8TB of data in 2 hours with an accuracy improved by 47% and a batch throughput multiplied by 13 compared to a traditional offline procedure.
翻译:近年来,深度学习方法在加速数值求解器方面取得了显著进展,这些求解器能够提供忠实但计算密集的物理世界模拟。这些深度代理模型通常通过监督学习方式训练,数据量有限,且由它们试图加速的同一求解器缓慢生成。我们提出了一种开源框架,能够从大规模集成模拟运行中在线训练这些模型。该框架利用多层次并行性生成丰富的数据集,通过直接流式传输生成的数据避免I/O瓶颈和存储问题。一个训练库在最大化GPU吞吐量的同时,缓解了流式传输固有的偏差。实验将全连接网络作为热方程的代理模型进行训练,结果表明,与传统的离线流程相比,该方法能够在2小时内对8TB数据进行训练,准确率提升47%,批次吞吐量提升13倍。