Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency.
翻译:基因序列搜索是计算基因组学中的一项基本操作。由于基因组数据库规模已达PB级别,目前大多数基因搜索系统采用基于哈希的数据结构,如布隆过滤器(BF)。最先进的系统,如紧凑位切片签名索引(COBS)和重复合并布隆过滤器(RAMBO),均使用配备随机哈希(RH)函数的BF进行基因表示与识别。标准方法是将基因搜索问题转化为一系列成员查询问题,即测试查询序列Q的每个后续基因子串(称为kmer)是否存在于整个基因数据库D的kmer集合中。我们观察到,RH函数虽然对BF的内存和计算优势至关重要,但也对基因搜索系统的性能产生不利影响。当被查询的后续kmer高度相似时,RH函数因无法感知相似性而将kmer均匀分布到可能很大的BF的不同区域,从而引发大量缓存未命中并导致系统性能下降。我们提出了一种称为基于局部性的恒等哈希(IDL)的新型哈希函数族,它能在不引起冲突的情况下,将输入空间中相近的键值映射到相邻位置。这种方法同时保证了缓存局部性和键值保持性。IDL函数可直接替代RH函数,有助于提升信息检索系统的性能。我们给出了一种简单实用的IDL函数族构造方法,并证明用IDL函数替换RH函数可将缓存未命中减少5倍,从而使COBS和RAMBO等先进方法的查询和索引时间提升高达2倍,且不损失其检索质量。我们还对采用IDL函数的BF的误报率进行了理论分析。我们的哈希函数首次将局部敏感哈希(LSH)与随机哈希(RH)相结合以实现缓存高效性。