Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
翻译:高效计算或近似莱文斯坦距离(一种广泛使用的序列相似性度量指标)随着DNA存储及其他生物应用的出现而受到广泛关注。序列嵌入技术通过将莱文斯坦距离映射为嵌入向量间的常规距离,已成为一种有前景的解决方案。本文提出了一种基于泊松回归的新型神经网络序列嵌入技术。我们首先对嵌入维度对模型性能的影响进行理论分析,并提出了选择合适嵌入维度的准则。在该嵌入维度下,通过假设固定长度序列间的莱文斯坦距离服从泊松分布引入泊松回归,这与莱文斯坦距离的定义自然吻合。此外,从嵌入距离分布的角度来看,泊松回归近似了卡方分布的负对数似然函数,并在消除偏态方面取得进展。通过在真实DNA存储数据上的综合实验,我们证明了所提方法相较于现有最优方法的优越性能。