Retrieving data from large-scale source code archives is vital for AI training, neural-based software analysis, and information retrieval, to cite a few. This paper studies and experiments with the design of a compressed key-value store for the indexing of large-scale source code datasets, evaluating its trade-off among three primary computational resources: (compressed) space occupancy, time, and energy efficiency. Extensive experiments on a national high-performance computing infrastructure demonstrate that different compression configurations yield distinct trade-offs, with high compression ratios and order-of-magnitude gains in retrieval throughput and energy efficiency. We also study data parallelism and show that, while it significantly improves speed, scaling energy efficiency is more difficult, reflecting the known non-energy-proportionality of modern hardware and challenging the assumption of a direct time-energy correlation. This work streamlines automation in energy-aware configuration tuning and standardized green benchmarking deployable in CI/CD pipelines, thus empowering system architects with a spectrum of Pareto-optimal energy-compression-throughput trade-offs and actionable guidelines for building sustainable, efficient storage backends for massive open-source code archival.
翻译:从大规模源代码档案中检索数据对于AI训练、基于神经网络的软件分析和信息检索等领域至关重要。本文研究并实验了一种用于大规模源代码数据集索引的压缩键值存储设计,评估其在三个主要计算资源之间的权衡:(压缩后)空间占用、时间和能效。在国家高性能计算基础设施上进行的大量实验表明,不同的压缩配置会产生不同的权衡结果,能够实现高压缩比以及检索吞吐量和能效的数量级提升。我们还研究了数据并行性,结果表明,虽然数据并行能显著提高速度,但扩展能效更为困难,这反映了现代硬件已知的非能量比例特性,并对时间与能耗直接相关的假设提出了挑战。这项工作简化了能耗感知配置调优和可部署于CI/CD流水线的标准化绿色基准测试的自动化流程,从而为系统架构师提供了一系列帕累托最优的能耗-压缩-吞吐量权衡方案,以及构建可持续、高效的大规模开源代码归档存储后端的可操作指南。