Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
翻译:最小化子(minimizers)与卷积神经网络(CNN)是两种截然不同的流行技术,均被用于分析分类生物序列。表面上看,这两种方法似乎毫无相似之处。最小化子通过在滑动窗口上应用最小哈希选取每个窗口中的一个重要k-mer特征;而CNN则始于大量随机初始化的卷积滤波器,结合池化操作,再通过多个额外神经网络层学习滤波器本身及其对序列分类的方法。本文的主要成果在于对哈希函数性质进行严谨数学分析,证明对于分类字母表上的序列,采用最大池化的随机高斯初始化卷积滤波器等价于选择一种最小化子排序,使得被选中的k-mer(在汉明距离上)既远离序列内的k-mer,又接近其他最小化子。实验表明,该属性表现为重复区域密度降低,在模拟数据及真实人类端粒序列中均如此。此外,我们从零训练了一个CNN嵌入模型,将SARS-CoV-2基因组的合成短读序列映射至三维欧氏空间,该空间局部重构了读段来源的线性序列距离——尽管当前速度过慢不具实用性,但这是向构建深度学习组装器迈出的初步尝试。综上,本文为CNN在分类序列分析中的有效性提供了部分理论解释。