We investigate the capacity of noisy frequency-based channels, motivated by DNA data storage in the short-molecule regime, where information is encoded in the frequency of items types rather than their order. The channel output is a histogram formed by random sampling of items, followed by noisy item identification. While the capacity of the noiseless frequency-based channel has been previously addressed, the effect of identification noise has not been fully characterized. We present a converse bound on the channel capacity that follows from stochastic degradation and the data processing inequality. We then establish an achievable bound, which is based on a Poissonization of the multinomial sampling process, and an analysis of the resulting vector Poisson channel with inter-symbol interference. This analysis refines concentration inequalities for the information density used in Feinstein bound, and explicitly characterizes an additive loss in the mutual information due to identification noise. We apply our results to a DNA storage channel in the short-molecule regime, and quantify the resulting loss in the scaling of the total number of reliably stored bits.
翻译:本文研究噪声频率信道的容量,其研究动机源于短分子区间的DNA数据存储,其中信息编码于项目类型的频率而非其顺序。信道输出是通过对项目进行随机采样并随后进行含噪声的项目识别所形成的直方图。虽然无噪声频率信道的容量已有前人研究,但识别噪声的影响尚未得到充分表征。我们提出了一个基于随机退化与数据处理不等式的信道容量逆界。随后,我们建立了一个可达界,该界基于多项采样过程的泊松化,以及对由此产生的具有符号间干扰的向量泊松信道的分析。此分析改进了用于Feinstein界的信息密度的集中不等式,并显式地表征了由识别噪声引起的互信息的加性损失。我们将结果应用于短分子区间的DNA存储信道,并量化了由此导致的可靠存储总比特数在缩放上的损失。