Similarity search is one of the most fundamental computations that are regularly performed on ever-increasing protein datasets. Scalability is of paramount importance for uncovering novel phenomena that occur at very large scales. We unleash the power of over 20,000 GPUs on the Summit system to perform all-vs-all protein similarity search on one of the largest publicly available datasets with 405 million proteins, in less than 3.5 hours, cutting the time-to-solution for many use cases from weeks. The variability of protein sequence lengths, as well as the sparsity of the space of pairwise comparisons, make this a challenging problem in distributed memory. Due to the need to construct and maintain a data structure holding indices to all other sequences, this application has a huge memory footprint that makes it hard to scale the problem sizes. We overcome this memory limitation by innovative matrix-based blocking techniques, without introducing additional load imbalance.
翻译:相似性搜索是针对不断增长的蛋白质数据集定期执行的最基础计算之一。可扩展性对于揭示超大规模下出现的新现象至关重要。我们利用Summit系统上超过2万个GPU的计算能力,在不到3.5小时内对包含4.05亿个蛋白质的最大公开可用数据集之一进行全对全蛋白质相似性搜索,将许多用例的求解时间从数周缩短。蛋白质序列长度的变异性以及成对比较空间的稀疏性,使得这一分布式内存问题充满挑战。由于需要构建并维护一个存储指向所有其他序列索引的数据结构,该应用具有巨大的内存占用,使得问题规模难以扩展。我们通过创新的基于矩阵的分块技术克服了这一内存限制,且未引入额外的负载不平衡。