We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify $M$ bees, each tagged by a unique barcode, via a set of $M$ noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes $N$ noisy measurements of each bee, and applied the model to address the unordered nature of DNA storage systems. In such systems, a unique address is typically prepended to each DNA data block to form a DNA strand, but the address may possibly be corrupted. While clustering is usually used to identify the address of a DNA strand, this requires $\mathcal{M}^2$ data comparisons (when $\mathcal{M}$ is the number of reads). In contrast, the approach of Chrisnata et al. (2022) avoids data comparisons completely. In this work, we study an intermediate, data-driven approach to this identification task. For the binary erasure channel, we first show that we can almost surely correctly identify all DNA strands under certain mild assumptions. Then we propose a data-driven pruning procedure and demonstrate that on average the procedure uses only a fraction of $\mathcal{M}^2$ data comparisons. Specifically, for $\mathcal{M}= 2^n$ and erasure probability $p$, the expected number of data comparisons performed by the procedure is $\kappa\mathcal{M}^2$, where $\left(\frac{1+2p-p^2}{2}\right)^n \leq \kappa \leq \left(\frac{1+p}{2}\right)^n $.
翻译:我们研究了DNA链中蜜蜂识别问题的数据驱动方法。蜜蜂识别问题由Tandon等人(2019年)提出,要求通过一组$M$个噪声测量值来识别$M$只蜜蜂,每只蜜蜂均被一个独特的条形码标记。随后,Chrisnata等人(2022年)将该模型扩展至每只蜜蜂可观测到$N$个噪声测量值的情况,并将其应用于解决DNA存储系统中的无序性问题。在此类系统中,通常会在每个DNA数据块前附加一个唯一地址以形成DNA链,但该地址可能被损坏。虽然通常使用聚类方法来识别DNA链的地址,但这需要进行$\mathcal{M}^2$次数据比较(其中$\mathcal{M}$为读取序列的数量)。相比之下,Chrisnata等人(2022年)的方法完全避免了数据比较。在本工作中,我们研究了针对该识别任务的中间型数据驱动方法。对于二元删除信道,我们首先证明在特定温和假设下几乎必然能正确识别所有DNA链。然后,我们提出了一种数据驱动的剪枝程序,并证明该程序平均仅需$\mathcal{M}^2$次数据比较的一小部分。具体而言,对于$\mathcal{M}= 2^n$且删除概率为$p$,该程序执行的预期数据比较次数为$\kappa\mathcal{M}^2$,其中$\left(\frac{1+2p-p^2}{2}\right)^n \leq \kappa \leq \left(\frac{1+p}{2}\right)^n $。