We revisit the problem of statistical sequence matching between two databases of sequences initiated by Unnikrishnan (TIT 2015) and derive theoretical performance guarantees for the generalized likelihood ratio test (GLRT). We first consider the case where the number of matched pairs of sequences between the databases is known. In this case, the task is to accurately find the matched pairs of sequences among all possible matches between the sequences in the two databases. We analyze the performance of the GLRT by Unnikrishnan and explicitly characterize the tradeoff between the mismatch and false reject probabilities under each hypothesis in both large and small deviations regimes. Furthermore, we demonstrate the optimality of Unnikrishnan's GLRT test under the generalized Neyman-Person criterion for both regimes and illustrate our theoretical results via numerical examples. Subsequently, we generalize our achievability analyses to the case where the number of matched pairs is unknown, and an additional error probability needs to be considered. When one of the two databases contains a single sequence, the problem of statistical sequence matching specializes to the problem of multiple classification introduced by Gutman (TIT 1989). For this special case, our result for the small deviations regime strengthens previous result of Zhou, Tan and Motani (Information and Inference 2020) by removing unnecessary conditions on the generating distributions.
翻译:我们重新审视由Unnikrishnan(TIT 2015)提出的两个序列数据库间的统计序列匹配问题,并推导广义似然比检验(GLRT)的理论性能保证。首先考虑数据库间匹配序列对数量已知的情况。在此情况下,任务是在两个数据库所有可能的序列匹配中准确找出匹配的序列对。我们分析了Unnikrishnan提出的GLRT性能,在大偏差与小偏差两种机制下,明确刻画了各假设条件下失配概率与错误拒绝概率之间的权衡关系。进一步,我们证明了Unnikrishnan的GLRT检验在广义Neyman-Pearson准则下对两种机制的最优性,并通过数值算例验证了理论结果。随后,我们将可达性分析推广到匹配对数量未知且需考虑额外错误概率的情形。当其中一个数据库仅包含单个序列时,统计序列匹配问题特化为Gutman(TIT 1989)提出的多重分类问题。针对这一特例,我们在小偏差机制下的结果强化了Zhou、Tan与Motani(Information and Inference 2020)的先前结论,消除了对生成分布的不必要限制条件。