DNA-based storage has emerged as a promising alternative to traditional data storage methods, offering unmatched advantages in data density, longevity, and sustainability. Two main approaches have developed: in-vitro storage, where information is synthesized in controlled environments, and in-vivo storage, where data is embedded within an organism's DNA for enhanced confidentiality and protection. While in-vivo DNA storage provides unique advantages, it faces significant challenges from mutations, including duplications, deletions, and substitutions, which cause sequence evolution over time. Thus, in-vivo systems experience continuous sequence alterations that increase length and change composition, making error correction particularly challenging. We study the asymptotic behavior of mutation systems, which model the probabilistic evolution of sequences over a finite alphabet, and are central to the analysis of in-vivo DNA-based data storage. Building upon prior works that established the limit of empirical $k$-tuple frequencies, we characterize the stochastic fluctuations around these values by establishing a Central Limit Theorem (CLT). Our approach leverages the spectral properties of the $k$-substitution matrix to project the centered count vectors, allowing us to approximate the system via a martingale difference sequence, and then verifying the classical martingale CLT conditions. In addition, we explicitly derive the limiting covariance matrix.
翻译:基于DNA的存储技术已成为传统数据存储方法的一种有前景的替代方案,在数据密度、持久性和可持续性方面具有无与伦比的优势。目前主要发展了两种方法:体外存储,即在受控环境中合成信息;以及体内存储,即将数据嵌入生物体的DNA中以增强保密性和保护性。虽然体内DNA存储提供了独特的优势,但它面临来自突变的重大挑战,包括重复、缺失和替换,这些突变会导致序列随时间的推移发生演化。因此,体内系统会经历持续的序列变化,从而增加长度并改变组成,这使得纠错尤为困难。我们研究了突变系统的渐近行为,该系统模拟了有限字母表上序列的概率演化,并且是分析体内DNA数据存储的核心。在先前建立了经验$k$元组频率极限的工作基础上,我们通过建立中心极限定理(CLT)来刻画这些值周围的随机波动。我们的方法利用$k$替换矩阵的谱性质来投影中心化计数向量,从而将系统近似为鞅差序列,然后验证经典鞅CLT条件。此外,我们显式推导了极限协方差矩阵。