Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios - including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics - illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications.
翻译:构建能够代表目标领域的数据集对于训练有效的机器学习模型至关重要。主动学习(AL)是一种有前景的方法,它通过迭代扩展训练数据来提升模型性能,同时最小化数据获取成本。然而,当前的主动学习工作流程通常需要人工干预且缺乏并行性,导致效率低下且未能充分利用现代计算资源。本文中,我们介绍了PAL,一个自动化、模块化、并行的主动学习库。它集成了主动学习任务,并使用消息传递接口(MPI)在共享内存和分布式内存系统上管理任务的执行与通信。PAL为用户提供了灵活设计和定制其主动学习场景所有组件的能力,包括具有不确定性估计的机器学习模型、用于真实标签标注的“预言机”以及探索目标空间的策略。我们证明,PAL显著降低了计算开销并提高了可扩展性,通过在CPU和GPU硬件上的异步并行化实现了显著的加速。PAL在多个实际场景中的应用——包括生物分子系统的基态反应、分子的激发态动力学、无机团簇的模拟以及热流体动力学——展示了其在加速机器学习模型开发方面的有效性。我们的结果表明,PAL能够在主动学习工作流程中高效利用高性能计算资源,从而推动科学研究和工程应用的进步。