DNA strands serve as a storage medium for $4$-ary data over the alphabet $\{A,T,G,C\}$. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low $GC$-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet $\{A,T,G,C\}$ with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.
翻译:DNA链作为字母表$\{A,T,G,C\}$上的四进制数据存储介质。DNA数据存储具有极高的信息密度、长期稳定性和易复制性。然而,这一引人注目的存储技术中的信息可能遭到破坏。实验表明,含有长同聚物序列和/或低GC含量的DNA序列在存储过程中更容易出现错误。本文研究了在DNA数据存储中应用最近提出的字典序约束(LOCO)编码方法。我们提出了基于字母表$\{A,T,G,C\}$、限制相同符号游程长度的DNA LOCO(D-LOCO)码。这些码配备了本文推导的编解码规则,其算法计算开销低。在存储开销方面,所提出的编解码算法优于现有文献中的方法,且算法易于重构。D-LOCO码具有内在平衡性,能够以极小的速率损失实现整个DNA链的平衡。此外,我们提出了四种码字衔接方案,其中三种可保证每个码字的单替换错误检测能力。我们分析了错误漏检概率,并证明D-LOCO码能够达到信道容量,在中长码长下具有极高的编码速率。