Given a set of strings over a specified alphabet, identifying a median or consensus string that minimizes the total distance to all input strings is a fundamental data aggregation problem. When the Hamming distance is considered as the underlying metric, this problem has extensive applications, ranging from bioinformatics to pattern recognition. However, modern applications often require the generation of multiple (near-)optimal yet diverse median strings to enhance flexibility and robustness in decision-making. In this study, we address this need by focusing on two prominent diversity measures: sum dispersion and min dispersion. We first introduce an exact algorithm for the diameter variant of the problem, which identifies pairs of near-optimal medians that are maximally diverse. Subsequently, we propose a $(1-ε)$-approximation algorithm (for any $ε>0$) for sum dispersion, as well as a bi-criteria approximation algorithm for the more challenging min dispersion case, allowing the generation of multiple (more than two) diverse near-optimal Hamming medians. Our approach primarily leverages structural insights into the Hamming median space and also draws on techniques from error-correcting code construction to establish these results.
翻译:给定一个在特定字母表上的字符串集合,识别一个最小化到所有输入字符串总距离的中位或共识字符串,是一个基本的数据聚合问题。当以汉明距离作为基础度量时,该问题具有广泛的应用,从生物信息学到模式识别。然而,现代应用通常需要生成多个(近似)最优且多样化的中位字符串,以增强决策的灵活性和鲁棒性。在本研究中,我们通过聚焦于两个主要的多样性度量——和分散度与最小分散度——来应对这一需求。我们首先针对该问题的直径变体引入了一种精确算法,该算法能识别出最大化多样性的近似最优中位对。随后,我们为和分散度提出了一种$(1-ε)$近似算法(对于任意$ε>0$),并为更具挑战性的最小分散度情况提出了一种双准则近似算法,从而能够生成多个(超过两个)多样化的近似最优汉明中位。我们的方法主要利用了汉明中位空间的结构性洞见,并借鉴了纠错码构造中的技术来建立这些结果。