The coverage depth problem in DNA data storage is about computing the expected number of reads needed to recover all encoded strands. Given a generator matrix of a linear code, this quantity equals the expected number of randomly drawn columns required to obtain full rank. While MDS codes are optimal when they exist, i.e., over large fields, practical scenarios may rely on structured code families defined over small fields. In this work, we develop combinatorial tools to solve the DNA coverage depth problem for various linear codes, based on duality arguments and the notion of extended weight enumerator. Using these methods, we derive closed formulas for the simplex, Hamming, ternary Golay, extended ternary Golay, and first-order Reed-Muller codes. The centerpiece of this paper is a general expression for the coverage depth of a linear code in terms of the weight distributions of its higher-field extensions.
翻译:DNA数据存储中的覆盖深度问题涉及计算恢复所有编码链所需的预期读取次数。给定线性码的生成矩阵,该数值等于获得满秩所需随机抽取列数的期望值。虽然最大距离可分码在存在时(即在大域上)是最优的,但实际场景可能依赖于在小域上定义的结构化码族。本文基于对偶论证与扩展重量计数器的概念,开发了组合工具以求解各类线性码的DNA覆盖深度问题。运用这些方法,我们推导了单纯形码、汉明码、三元戈莱码、扩展三元戈莱码以及一阶里德-穆勒码的闭式解。本文的核心成果是提出了线性码覆盖深度与其高域扩展码重量分布之间关系的通用表达式。