Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly $120/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand the effect of error-correcting codes and retrieval algorithms on the required sequencing coverage depth. We establish that the expected number of reads that are required for information retrieval is minimized when the channel follows a uniform distribution. We also derive upper and lower bounds on the probability distribution of this number of required reads and provide a comprehensive upper and lower bound on its expected value. We further prove that for a noiseless channel and uniform distribution, MDS codes are optimal in terms of minimizing the expected number of reads. Additionally, we study the DNA coverage depth problem under the random-access setup, in which the user aims to retrieve just a specific information unit from the entire DNA storage system. We prove that the expected retrieval time is at least k for [n,k] MDS codes as well as for other families of codes. Furthermore, we present explicit code constructions that achieve expected retrieval times below k and evaluate their performance through analytical methods and simulations. Lastly, we provide lower bounds on the maximum expected retrieval time. Our findings offer valuable insights for reducing the cost and latency of DNA storage.
翻译:尽管DNA测序的成本在快速下降,但目前测序信息的成本约为120美元/GB,这比现有档案存储解决方案的读取成本要高得多。在这项工作中,我们通过启动DNA覆盖深度问题的研究,旨在降低DNA存储的成本和延迟,该问题旨在减少从存储系统中检索信息所需的读取次数。在此框架下,我们的主要目标是理解纠错码和检索算法对所需测序覆盖深度的影响。我们证明,当信道服从均匀分布时,信息检索所需的期望读取次数最小化。我们还推导了所需读取次数的概率分布的上界和下界,并对其期望值提供了全面的上下界。进一步,我们证明,对于无噪声信道和均匀分布,MDS码在最小化期望读取次数方面是最优的。此外,我们研究了随机访问场景下的DNA覆盖深度问题,在该场景中,用户旨在从整个DNA存储系统中检索特定的信息单元。我们证明,对于[n,k] MDS码以及其他类型的码族,期望检索时间至少为k。此外,我们提出了能够实现低于k的期望检索时间的显式码构造,并通过分析方法和模拟评估其性能。最后,我们给出了最大期望检索时间的下界。我们的发现为降低DNA存储的成本和延迟提供了宝贵见解。