Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to $2.5$ trillion edges, that is, $6.6$ times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-to-all similarity aligning) $1.7$ billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types. We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. We present a comparative study of structural characteristics of MS-BioGraphs. The datasets are available online on https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs .
翻译:高性能计算领域(尤其是高性能图处理)的进展高度依赖于公开、相关且真实数据集的可用性。为确保这一进程的持续性,我们(i)研究并优化了生成大规模序列相似性图作为HPC挑战的过程,以及(ii)通过创建MS-BioGraphs演示了这一过程——这是一个新系列的公开真实世界边加权图数据集,包含高达$2.5$万亿条边,是近期发布的最大图的$6.6$倍。该最大图通过匹配(即全对全相似性比对)$17$亿条蛋白质序列生成。MS-BioGraphs系列还包括七个具有不同规模和方向类型的子图。我们描述了生成大型图数据集时面临的两个主要挑战及其解决方案,即(i)优化此多步骤过程中的数据结构与算法,以及(ii)WebGraph并行压缩技术。我们开展了MS-BioGraphs结构特征的比较研究。该数据集可通过https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs在线获取。