Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.

翻译：计算工作流是超级计算机上一类常见的应用，但其松散耦合和异构的特性往往未能充分利用超级计算机的全部能力。我们开发了Colmena，通过人工智能在学习中适应工作流的执行过程，从而利用超级计算机的大规模并行能力。Colmena允许科研人员将应用程序对事件（例如任务完成）的响应定义为一组协同智能体。本文阐述了Colmena的设计架构、在百亿亿次系统上部署应用时克服的挑战，以及通过融合人工智能而增强的科研工作流。我们讨论的扩展性挑战包括：开发最大化节点利用率的引导策略、引入降低数据密集型任务通信开销的数据结构、实现能在多次调用间缓存高成本操作的工作流任务。这些创新结合我们基于智能体的引导模型所提供的多种应用模式，已推动化学、生物物理和材料科学领域借助不同类型的人工智能取得科研进展。我们期望Colmena能激发跨科学计算多领域的创造性人工智能解决方案。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日