Vigemers: on the number of $k$-mers sharing the same XOR-based minimizer

In bioinformatics, minimizers have become an inescapable method for handling $k$-mers (words of fixed size $k$) extracted from DNA or RNA sequencing, whether for sampling, storage, querying or partitioning. According to some fixed order on $m$-mers ($m<k$), the minimizer of a $k$-mer is defined as its smallest $m$-mer -- and acts as its fingerprint. Although minimizers are widely used for partitioning purposes, there is almost no theoretical work on the quality of the resulting partitions. For instance, it has been known for decades that the lexicographic order empirically leads to highly unbalanced partitions that are unusable in practice, but it was not until very recently that this observation was theoretically substantiated. The rejection of the lexicographic order has led the community to resort to (pseudo-)random orders using hash functions. In this work, we extend the theoretical results relating to the partitions obtained by the lexicographical order, departing from it to a (exponentially) large family of hash functions, namely where the $m$-mers are XORed against a fixed key. More precisely, provided a key $γ$ and a $m$-mer $w$, we investigate the function that counts how many $k$-mers admit $w$ as their minimizer (i.e. where $w\oplusγ$ is minimal among all $m$-mers of said $k$-mers). This number, denoted by $π_k^γ(w)$, represents the maximum size of the bucket associated with $w$, if all possible $k$-mers were to be seen and partitioned. We adapt the (lexicographical order) method of the literature to our framework and propose combinatorial equations that allow to compute, using dynamic programming, $π_k^γ(w)$ in $O(km^2)$ time and $O(km)$ space.

翻译：在生物信息学中，最小化子已成为处理从DNA或RNA测序中提取的$k$-mer（固定长度$k$的词）时不可或缺的方法，无论是用于采样、存储、查询还是分区。根据$m$-mer（$m<k$）的某种固定顺序，$k$-mer的最小化子被定义为其最小的$m$-mer——并充当其指纹。尽管最小化子被广泛用于分区目的，但关于所得分区质量的理论研究几乎不存在。例如，数十年来已知字典序在实践中会导致高度不平衡的分区而无法使用，但直到最近这一观察才得到理论证实。对字典序的摒弃促使研究社区转向使用哈希函数的（伪）随机顺序。在本工作中，我们扩展了与字典序所得分区相关的理论结果，从字典序转向一个（指数级）庞大的哈希函数族，即$m$-mer与固定密钥进行异或运算的族。更准确地说，给定密钥$γ$和$m$-mer $w$，我们研究计算有多少$k$-mer以$w$作为其最小化子（即在这些$k$-mer的所有$m$-mer中，$w\oplusγ$为最小）的函数。该数量记为$π_k^γ(w)$，表示若所有可能的$k$-mer均被观测并分区时，与$w$关联的桶的最大容量。我们采用文献中的（字典序）方法并调整至本框架，提出组合方程，允许通过动态规划在$O(km^2)$时间和$O(km)$空间内计算$π_k^γ(w)$。