Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.
翻译:计算工作流是超级计算机上一类常见的应用,但其松散耦合和异构的特性往往未能充分利用超级计算机的全部能力。我们开发了Colmena,通过人工智能在学习中适应工作流的执行过程,从而利用超级计算机的大规模并行能力。Colmena允许科研人员将应用程序对事件(例如任务完成)的响应定义为一组协同智能体。本文阐述了Colmena的设计架构、在百亿亿次系统上部署应用时克服的挑战,以及通过融合人工智能而增强的科研工作流。我们讨论的扩展性挑战包括:开发最大化节点利用率的引导策略、引入降低数据密集型任务通信开销的数据结构、实现能在多次调用间缓存高成本操作的工作流任务。这些创新结合我们基于智能体的引导模型所提供的多种应用模式,已推动化学、生物物理和材料科学领域借助不同类型的人工智能取得科研进展。我们期望Colmena能激发跨科学计算多领域的创造性人工智能解决方案。