Bounds and Constructions of Codes for Ordered Composite DNA Sequences

This paper extends the foundational work of Dollma \emph{et al}. on codes for ordered composite DNA sequences. We consider the general setting with an alphabet of size $q$ and a resolution parameter $k$, moving beyond the binary ($q=2$) case primarily studied previously. We investigate error-correcting codes for substitution errors and deletion errors under several channel models, including $(e_1,\ldots,e_k)$-composite error/deletion, $e$-composite error/deletion, and the newly introduced $t$-$(e_1,\ldots,e_t)$-composite error/deletion model. We first establish equivalence relations among families of composite-error correcting codes (CECCs) and among families of composite-deletion correcting codes (CDCCs). This significantly reduces the number of distinct error-parameter sets that require separate analysis. We then derive novel and general upper bounds on the sizes of CECCs using refined sphere-packing arguments and probabilistic methods. These bounds together cover all values of parameters $q$, $k$, $(e_1,\ldots,e_k)$ and $e$. In contrast, previous bounds were only established for $q=2$ and limited choices of $k$, $(e_1,\ldots,e_k)$ and $e$. For CDCCs, we generalize a known non-asymptotic upper bound for $(1,0,\ldots,0)$-CDCCs and then provide a cleaner asymptotic bound. On the constructive side, for any $q\ge2$, we propose $(1,0,\ldots,0)$-CDCCs, $1$-CDCCs and $t$-$(1,\ldots,1)$-CDCCs with near-optimal redundancies. These codes have efficient and systematic encoders. For substitution errors, we design the first explicit encoding and decoding algorithms for the binary $(1,0,\ldots,0)$-CECC constructed by Dollma \emph{et al}, and extend the approach to general $q$. Furthermore, we give an improved construction of binary $1$-CECCs, a construction of nonbinary $1$-CECCs, and a construction of $t$-$(1,\ldots,1)$-CECCs. These constructions are also systematic.

翻译：本文扩展了Dollma等人关于有序复合DNA序列码的基础性工作。我们考虑一般情形，即字母表大小为$q$和分辨参数为$k$，超越了先前主要研究的二进制($q=2$)情形。我们研究了在多种信道模型下的替换错误和删除错误的纠错码，包括$(e_1,\ldots,e_k)$-复合错误/删除、$e$-复合错误/删除以及新引入的$t$-$(e_1,\ldots,e_t)$-复合错误/删除模型。我们首先建立了复合错误纠错码族之间以及复合删除纠错码族之间的等价关系。这显著减少了需要单独分析的错误参数集的数量。随后，我们利用改进的球填充论证和概率方法，推导了关于复合错误纠错码大小的新颖且通用的上界。这些上界共同覆盖了所有参数$q$、$k$、$(e_1,\ldots,e_k)$和$e$的取值。相比之下，先前建立的界仅适用于$q=2$以及有限的$k$、$(e_1,\ldots,e_k)$和$e$的选择。对于复合删除纠错码，我们推广了一个已知的$(1,0,\ldots,0)$-复合删除纠错码的非渐近上界，并给出了一个更简洁的渐近上界。在构造方面，对于任意$q\ge2$，我们提出了具有接近最优冗余度的$(1,0,\ldots,0)$-复合删除纠错码、$1$-复合删除纠错码和$t$-$(1,\ldots,1)$-复合删除纠错码。这些码具有高效且系统化的编码器。对于替换错误，我们为Dollma等人构造的二进制$(1,0,\ldots,0)$-复合错误纠错码设计了首个显式的编码和解码算法，并将该方法推广到一般的$q$。此外，我们给出了二进制$1$-复合错误纠错码的改进构造、非二进制$1$-复合错误纠错码的构造以及$t$-$(1,\ldots,1)$-复合错误纠错码的构造。这些构造也是系统化的。