A Combinatorial Perspective on Random Access Efficiency for DNA Storage

We investigate the fundamental limits of the recently proposed random access coverage depth problem for DNA data storage. Under this paradigm, it is assumed that the user information consists of $k$ information strands, which are encoded into $n$ strands via a generator matrix $G$. During the sequencing process, the strands are read uniformly at random, as each strand is available in a large number of copies. In this context, the random access coverage depth problem refers to the expected number of reads (i.e., sequenced strands) required to decode a specific information strand requested by the user. This problem heavily depends on the generator matrix $G$, and besides computing the expectation for different choices of $G$, the goal is to construct matrices that minimize the maximum expectation over all possible requested information strands, denoted by $T_{\max}(G)$. In this paper, we introduce new techniques to investigate the random access coverage depth problem, capturing its combinatorial nature and identifying the structural properties of generator matrices that are advantageous. We establish two general formulas to determine $T_{\max}(G)$ for arbitrary generator matrices. The first formula depends on the linear dependencies between columns of $G$, whereas the second formula takes into account recovery sets and their intersection structure. We also introduce the concept of recovery balanced codes and provide three sufficient conditions for a code to be recovery balanced. These conditions can be used to compute $T_{\max}(G)$ for various families of codes, such as MDS, simplex, Hamming, and binary Reed-Muller codes. Additionally, we study the performance of modified systematic MDS and simplex matrices, showing that the best results for $T_{\max}(G)$ are achieved with a specific combination of encoded strands and replication of the information strands.

翻译：我们研究了近期提出的DNA数据存储随机访问覆盖深度问题的基本极限。在此范式下，假设用户信息由$k$条信息链组成，通过生成矩阵$G$编码为$n$条链。在测序过程中，由于每条链存在大量副本，链被均匀随机读取。在此背景下，随机访问覆盖深度问题指解码用户请求的特定信息链所需的预期读取次数（即测序链数）。该问题高度依赖于生成矩阵$G$，除计算不同$G$选择的期望值外，目标在于构建能最小化所有可能请求信息链的最大期望值（记为$T_{\max}(G)$）的矩阵。本文引入新技术研究随机访问覆盖深度问题，捕捉其组合本质并识别具有优势的生成矩阵结构特性。我们建立了两个通用公式以确定任意生成矩阵的$T_{\max}(G)$：第一个公式依赖于$G$列之间的线性相关性，第二个公式则考虑恢复集及其交集结构。同时引入恢复平衡码的概念，并提供码成为恢复平衡的三个充分条件。这些条件可用于计算多种码族的$T_{\max}(G)$，例如MDS码、单纯形码、汉明码和二进制里德-穆勒码。此外，我们研究了修正系统化MDS矩阵与单纯形矩阵的性能，证明$T_{\max}(G)$的最佳结果可通过编码链与信息链复制的特定组合实现。