We consider the problem of finding the matching map between two sets of $d$-dimensional noisy feature-vectors. The distinctive feature of our setting is that we do not assume that all the vectors of the first set have their corresponding vector in the second set. If $n$ and $m$ are the sizes of these two sets, we assume that the matching map that should be recovered is defined on a subset of unknown cardinality $k^*\le \min(n,m)$. We show that, in the high-dimensional setting, if the signal-to-noise ratio is larger than $5(d\log(4nm/\alpha))^{1/4}$, then the true matching map can be recovered with probability $1-\alpha$. Interestingly, this threshold does not depend on $k^*$ and is the same as the one obtained in prior work in the case of $k = \min(n,m)$. The procedure for which the aforementioned property is proved is obtained by a data-driven selection among candidate mappings $\{\hat\pi_k:k\in[\min(n,m)]\}$. Each $\hat\pi_k$ minimizes the sum of squares of distances between two sets of size $k$. The resulting optimization problem can be formulated as a minimum-cost flow problem, and thus solved efficiently. Finally, we report the results of numerical experiments on both synthetic and real-world data that illustrate our theoretical results and provide further insight into the properties of the algorithms studied in this work.
翻译:我们考虑在两组$d$维含噪特征向量之间寻找匹配映射的问题。本场景的显著特征在于:我们并不假定第一组中的所有向量在第二组中均存在对应向量。设两组向量规模分别为$n$和$m$,我们假设待恢复的匹配映射定义于一个基数未知的子集上,该子集满足$k^*\le \min(n,m)$。研究表明,在高维场景下,若信噪比大于$5(d\log(4nm/\alpha))^{1/4}$,则真实匹配映射能以概率$1-\alpha$被恢复。有趣的是,该阈值与$k^*$无关,且与先前工作针对$k = \min(n,m)$情形所获得的阈值相同。通过数据驱动的候选映射集$\{\hat\pi_k:k\in[\min(n,m)]\}$的选择机制,可证明上述性质成立。每个$\hat\pi_k$通过最小化两组规模为$k$的向量间距离平方和获得。相应的优化问题可转化为最小费用流问题并高效求解。最后,我们报告了合成数据与真实数据上的数值实验结果,这些结果既验证了理论结论,也深化了对本文所研究算法性质的理解。