Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
翻译:最小化子(minimizers)与卷积神经网络(convolutional neural networks, CNNs)是两种截然不同的流行技术,均被用于分析分类生物序列。表面看来,这两种方法似乎完全相异。最小化子通过滑动窗口上的最小哈希(min-wise hashing)提取每个窗口内单个重要的k-mer特征;而CNN则始于大量随机初始化的卷积滤波器,结合池化操作,并辅以多个额外神经层,以学习滤波器本身及其对序列进行分类的方法。本文的核心结果是对哈希函数性质的严谨数学分析:对于基于分类字母表的序列,采用最大池化的随机高斯初始化卷积滤波器等价于选择一种最小化子排序方式,使得被选中的k-mer在汉明距离上远离序列中的其他k-mer,却与其他最小化子接近。在实证实验中,我们通过模拟和真实人类端粒数据发现,该性质在重复区域表现为密度降低。此外,我们从零训练一个卷积神经网络嵌入,将SARS-CoV-2基因组的合成短序列映射到三维欧氏空间,该空间局部重构了序列起始点的线性距离关系。这虽是实现深度学习组装器的初步尝试,但目前因速度过慢而难以实用。总体而言,本文为CNN在分类序列分析中的有效性提供了部分理论解释。