Motivated by the operation of decoders for DNA storage, we consider the problem of unsupervised clustering of noisy short sequences, each generated from one of multiple possible unknown source sequences after passing through a memoryless channel. Focusing on the statistically optimal clustering rule, we derive upper and lower bounds on the probability of incorrect clustering as a function of the sequence length, the number of reads, and the channel statistics.
翻译:暂无翻译