Coded Information Retrieval for Block-Structured DNA-Based Data Storage

We study the problem of coded information retrieval for block-structured data, motivated by DNA-based storage systems where a database is partitioned into multiple files that must each be recoverable as an atomic unit. We initiate and formalize the block-structured retrieval problem, wherein $k$ information symbols are partitioned into two files $F_1$ and $F_2$ of sizes $s_1$ and $s_2 = k - s_1$. The objective is to characterize the set of achievable expected retrieval time pairs $\bigl(E_1(G), E_2(G)\bigr)$ over all $[n,k]$ linear codes with generator matrix $G$. We derive a family of linear lower bounds via mutual exclusivity of recovery sets, and develop a nonlinear geometric bound via column projection. For codes with no mixed columns, this yields the hyperbolic constraint $s_1/E_1 + s_2/E_2 \le 1$, which we conjecture to hold universally whenever $\max\{s_1,s_2\} \ge 2$. We analyze explicit codes, such as the identity code, file-dedicated MDS codes, and the systematic global MDS code, and compute their exact expected retrieval times. For file-dedicated codes we prove MDS optimality within the family and verify the hyperbolic constraint. For global MDS codes, we establish dominance by the proportional local MDS allocation via a combinatorial subset-counting argument, providing a significantly simpler proof compared to recent literature and formally extending the result to the asymmetric case. Finally, we characterize the limiting achievability region as $n \to \infty$: the hyperbolic boundary is asymptotically achieved by file-dedicated MDS codes, and is conjectured to be the exact boundary of the limiting achievability region.

翻译：本文研究块状数据的编码信息检索问题，其背景为DNA存储系统——在该系统中，数据库被划分为多个文件，每个文件必须作为原子单元可恢复。我们首次提出并形式化定义了块状检索问题：将$k$个信息符号划分为两个文件$F_1$和$F_2$，其大小分别为$s_1$和$s_2 = k - s_1$。目标是在所有生成矩阵为$G$的$[n,k]$线性码上，刻画可达的期望检索时间对集合$\bigl(E_1(G), E_2(G)\bigr)$。我们通过恢复集的互斥性推导出一族线性下界，并借助列投影建立了非线性几何界。对于无混合列的编码，这导出了双曲约束$s_1/E_1 + s_2/E_2 \le 1$；我们推测当$\max\{s_1,s_2\} \ge 2$时该约束普遍成立。我们分析了具体编码方案——如单位码、文件专用MDS码以及系统全局MDS码——并精确计算了它们的期望检索时间。对于文件专用码，我们证明了其在同类编码中的MDS最优性，并验证了双曲约束。针对全局MDS码，通过组合子集计数论证，我们确立了按比例分配的局部MDS方案的主导地位；相比近期文献，该证明显著简化，并将结果正式推广至非对称情形。最后，我们刻画了当$n \to \infty$时的极限可达区域：文件专用MDS码渐近达到双曲边界，且该边界被推测为极限可达区域的精确边界。