Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach is our use of cloud-hosted management services to manage challenging aspects of cross-resource authentication and authorization, function-as-a-service (FaaS) function invocation, and data transfer. We show that these methods can achieve performance parity with systems that rely on direct connection between resources. We achieve parity by integrating the FaaS system and data transfer capabilities with a system that passes data by reference among managers and workers, and a user-configurable steering algorithm to hide data transfer latencies. We anticipate that this ease of use can enable routine use of heterogeneous resources in computational science.
翻译:融合机器学习与仿真的应用可通过利用多种计算资源获益,例如在高度并行的超级计算机上运行仿真代码,同时在专用加速器上执行AI训练与推理任务。本文介绍了我们跨此类异构系统部署两种AI引导仿真工作流的实践经验。我们方法的一个独特之处在于使用云托管管理服务来处理跨资源认证授权、函数即服务(FaaS)函数调用及数据传输等具有挑战性的方面。研究表明,这些方法能够实现与依赖资源间直连系统相当的性能。通过将FaaS系统及数据传输能力与一种在管理者和工作者之间通过引用传递数据的系统相集成,并结合用户可配置的导向算法以隐藏数据传输延迟,我们达成了性能对等。我们预期这种易用性将促进异构资源在计算科学中的常规化应用。