A popular approach to deploying scientific applications in high performance computing (HPC) is Linux containers, which package an application and all its dependencies as a single unit. This image is built by interpreting instructions in a machine-readable recipe, which is faster with a build cache that stores instruction results for re-use. The standard approach (used e.g. by Docker and Podman) is a many-layered union filesystem, encoding differences between layers as tar archives. Our experiments show this performs similarly to layered caches on both build time and disk usage, with a considerable advantage for many-instruction recipes. Our approach also has structural advantages: better diff format, lower cache overhead, and better file de-duplication. These results show that a Git-based cache for layer-free container implementations is not only possible but may outperform the layered approach on important dimensions.
翻译:高性能计算(HPC)领域部署科学应用的常用方法是Linux容器——将应用程序及其所有依赖项打包为单一单元。该镜像通过解释机器可读的构建指令生成,而构建缓存通过存储指令结果实现复用,能够显著加速构建过程。主流方案(如Docker和Podman采用的)是基于多层联合文件系统,将层间差异编码为tar归档文件。我们通过实验证明,本方法在构建时间与磁盘占用方面与分层缓存性能相当,且对多指令构建配方具有显著优势。此外,本方法在结构上具备以下优势:更优的差异格式、更低的缓存开销以及更强的文件去重能力。研究结果表明,面向无层容器实现的Git缓存方案不仅可行,更可能在关键维度上超越分层方案。