The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by insertions, deletions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of data retrieval. In this work, we propose TReconLM, a decoder-only transformer that solves trace reconstruction as a next-token prediction task. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep-learning approaches, recovering a substantially higher fraction of sequences without error. We pretrain on synthetic data generated from a simple error model and fine-tune on real-world data to adapt to technology-specific error patterns. Code is available at https://github.com/MLI-lab/TReconLM.
翻译:通用痕迹重建问题旨在从独立受插入、删除和替换噪声污染的副本中恢复原始序列。该问题出现在DNA数据存储等应用中,DNA作为一种有前景的存储介质,具有高信息密度和长寿命的特点。然而,DNA合成、存储和测序过程中引入的错误需要通过算法和编码进行校正,痕迹重建通常作为数据检索的一部分。在本工作中,我们提出TReconLM,一种仅解码器架构的Transformer模型,将痕迹重建视为下一个词元预测任务。TReconLM超越了现有最先进的痕迹重建算法(包括先前的深度学习方法),在无错误恢复序列的比例上显著更高。我们基于简单错误模型生成的合成数据完成预训练,并在真实世界数据上进行微调,以适应特定技术的错误模式。代码已开源:https://github.com/MLI-lab/TReconLM。