Motivated by DNA based data storage system, we investigate the errors that occur when synthesizing DNA strands in parallel, where each strand is appended one nucleotide at a time by the machine according to a template supersequence. If there is a cycle such that the machine fails, then the strands meant to be appended at this cycle will not be appended, and we refer to this as a synthesis defect. In this paper, we present two families of codes correcting synthesis defects, which are t-known-synthesis-defect correcting codes and t-synthesis-defect correcting codes. For the first one, it is assumed that the defective cycles are known, and each of the codeword is a quaternary sequence. We provide constructions for this family of codes for t = 1, 2, with redundancy log 4 and log n+18 log 3, respectively. For the second one, the codeword is a set of M ordered sequences, and we give constructions for t = 1, 2 to show a strategy for constructing this family of codes. Finally, we derive a lower bound on the redundancy for single-known-synthesis-defect correcting codes, which assures that our construction is almost optimal.
翻译:受DNA数据存储系统的启发,我们研究了并行合成DNA链时出现的错误——机器根据模板超序列在每个时间步为每条链添加一个核苷酸。若存在某个循环导致机器工作失败,则本应在该循环添加的核苷酸链将无法合成,我们称此类情况为合成缺陷。本文提出了两类纠正合成缺陷的编码方案:t-已知合成缺陷纠正码和t-合成缺陷纠正码。对于第一类编码,假定缺陷循环位置已知,每个码字为四元序列。我们给出了t=1和t=2时的构造方案,其冗余度分别为log 4和log n+18 log 3。对于第二类编码,码字由M个有序序列构成,通过t=1和t=2时的构造示例展示了该类编码的构建策略。最后,我们推导了单已知合成缺陷纠正码的冗余度下界,证明所提构造方案具有近似最优性。