We consider the reconstruction of a codeword from multiple noisy copies that are independently corrupted by insertions, deletions, and substitutions. This problem arises, for example, in DNA data storage. A common code construction uses a concatenated coding scheme that combines an outer linear block code with an inner code, which can be either a nonlinear marker code or a convolutional code. Outer decoding is done with Belief Propagation, and inner decoding is done with the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm. However, the BCJR algorithm scales exponentially with the number of noisy copies, which makes it infeasible to reconstruct a codeword from more than about four copies. In this work, we introduce BCJRFormer, a transformer-based neural inner decoder. BCJRFormer achieves error rates comparable to the BCJR algorithm for binary and quaternary single-message transmissions of marker codes. Importantly, BCJRFormer scales quadratically with the number of noisy copies. This property makes BCJRFormer well-suited for DNA data storage, where multiple reads of the same DNA strand occur. To lower error rates, we replace the Belief Propagation outer decoder with a transformer-based decoder. Together, these modifications yield an efficient and performant end-to-end transformer-based pipeline for decoding multiple noisy copies affected by insertion, deletion, and substitution errors. Additionally, we propose a novel cross-attending transformer architecture called ConvBCJRFormer. This architecture extends BCJRFormer to decode transmissions of convolutional codewords, serving as an initial step toward joint inner and outer decoding for more general linear code classes.
翻译:我们考虑从多个独立受到插入、删除和替换错误干扰的噪声副本中重建码字。这一问题在DNA数据存储等领域中出现。常见的编码构造采用级联编码方案,将外部的线性分组码与内部编码结合,内部编码可以是非线性标记码或卷积码。外部解码采用置信传播算法,内部解码则使用Bahl-Cocke-Jelinek-Raviv(BCJR)算法。然而,BCJR算法的复杂度随噪声副本数量呈指数级增长,使得从超过约四个副本重建码字变得不可行。在本研究中,我们提出了BCJRFormer,一种基于Transformer的神经内部解码器。对于二进制和四进制单消息传输的标记码,BCJRFormer实现了与BCJR算法相当的误码率。重要的是,BCJRFormer的复杂度随噪声副本数量呈二次方增长。这一特性使BCJRFormer非常适合DNA数据存储,其中同一DNA链会产生多次读取。为降低误码率,我们将置信传播外部解码器替换为基于Transformer的解码器。这些改进共同构建了一个高效且性能优异的端到端基于Transformer的流水线,用于解码受插入、删除和替换错误影响的多个噪声副本。此外,我们提出了一种新颖的交叉注意力Transformer架构,称为ConvBCJRFormer。该架构将BCJRFormer扩展至卷积码字的解码,为更广泛的线性码类实现联合内部和外部解码迈出了初步的一步。