Ever since deoxyribonucleic acid (DNA) was considered as a next-generation data-storage medium, lots of research efforts have been made to correct errors occurred during the synthesis, storage, and sequencing processes using error correcting codes (ECCs). Previous works on recovering the data from the sequenced DNA pool with errors have utilized hard decoding algorithms based on a majority decision rule. To improve the correction capability of ECCs and robustness of the DNA storage system, we propose a new iterative soft decoding algorithm, where soft information is obtained from FASTQ files and channel statistics. In particular, we propose a new formula for log-likelihood ratio (LLR) calculation using quality scores (Q-scores) and a redecoding method which may be suitable for the error correction and detection in the DNA sequencing area. Based on the widely adopted encoding scheme of the fountain code structure proposed by Erlich et al., we use three different sets of sequenced data to show consistency for the performance evaluation. The proposed soft decoding algorithm gives 2.3% ~ 7.0% improvement of the reading number reduction compared to the state-of-the-art decoding method and it is shown that it can deal with erroneous sequenced oligo reads with insertion and deletion errors.
翻译:自脱氧核糖核酸(DNA)被视作下一代数据存储介质以来,大量研究致力于利用纠错码(ECC)纠正合成、存储和测序过程中产生的错误。以往从含有错误的测序DNA池中恢复数据的工作,通常采用基于多数表决规则的硬解码算法。为提升ECC的纠错能力及DNA存储系统的鲁棒性,我们提出了一种新的迭代软解码算法,该算法从FASTQ文件和信道统计中获取软信息。具体而言,我们提出了一种利用质量评分(Q-score)计算对数似然比(LLR)的新公式,以及一种适用于DNA测序领域纠错检测的重新解码方法。基于Erlich等人提出的喷泉码结构这一广泛采用的编码方案,我们使用三组不同的测序数据验证性能评估的一致性。与现有最优解码方法相比,所提出的软解码算法在读数缩减率上提升了2.3%至7.0%,并且能够处理含有插入和删除错误的测序寡核苷酸序列。