DNA is an attractive medium for digital data storage. When data is stored on DNA, errors occur, which makes error-correcting coding techniques critical for reliable DNA data storage. To reduce the errors, a common technique is to include constraints that avoid homopolymers (consecutive repeated nucleotides) and balance the GC content, as sequences with homopolymers and unbalanced GC content are often associated with higher error rates. However, constrained coding comes at the cost of an increase in redundancy. An alternative is to control errors by randomizing the sequences, embracing errors, and paying for them with additional coding redundancy. In this paper, we determine the error regimes in which embracing substitutions is more efficient than constrained coding for DNA data storage. Our results suggest that constrained coding for substitution errors is inefficient for existing DNA data storage systems. Theoretical analysis indicates that for constrained coding to be efficient, the increase in substitution errors for nucleotides in homopolymers and sequences with unbalanced GC content must be very large. Additionally, empirical results show that the increase in substitution, deletion, and insertion rates for these nucleotides is minimal in existing DNA storage systems.
翻译:DNA作为一种数字数据存储介质具有显著吸引力。当数据存储于DNA时,错误的发生使得纠错编码技术成为确保DNA数据存储可靠性的关键。为降低错误率,常用技术是引入约束条件以避免同聚物(连续重复的核苷酸)并平衡GC含量,因为含有同聚物和GC含量失衡的序列通常与较高的错误率相关。然而,约束编码的代价是冗余度的增加。另一种方案是通过序列随机化来控制错误,即接纳错误并利用额外的编码冗余进行补偿。本文确定了在DNA数据存储中,接纳替换错误比约束编码更高效的错误机制范围。研究结果表明,在当前DNA数据存储系统中,针对替换错误的约束编码是低效的。理论分析表明,要使约束编码具有效率,同聚物内核苷酸及GC含量失衡序列的替换错误增长率必须非常高。此外,实证结果显示,在现有DNA存储系统中,这些核苷酸的替换、缺失及插入错误率的增长幅度极小。