Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG's OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances.
翻译:新型材料研发、新药发现及系统模拟是科研创新的核心过程,需要大量计算资源支持。虽然许多计算任务可分解为众多独立子任务,但部分任务无法分解,可能需数小时乃至数周才能运行完成。为优化长周期作业管理,理想方案是能在任意时间点暂停计算,随后在另一计算资源上继续执行,这种技术通常称为检查点机制。尽管部分应用可通过编程方式实现检查点,但更优方案是由批处理调度系统独立完成该功能。本文评估了在GNU/Linux环境下使用开源工具CRIU(用户空间检查点恢复)的可行性,重点针对OSG的OSPooL HTCondor配置场景。CRIU可将进程状态保存至磁盘镜像,并无缝处理打开文件与已建立的网络连接,同时支持传统Linux进程及容器化工作负载的检查点操作。实验表明该工具功能基本满足OSPooL多数场景需求,但特定限制因素仍制约其全面应用。