We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover's Distance ($\mathsf{EMD}$), and as a result, improve the best approximation for nearest neighbor search under $\mathsf{EMD}$ by a quadratic factor. Here, the metric $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ consists of sets of $s$ vectors in $\mathbb{R}^d$, and for any two sets $x,y$ of $s$ vectors the distance $\mathsf{EMD}(x,y)$ is the minimum cost of a perfect matching between $x,y$, where the cost of matching two vectors is their $\ell_p$ distance. Previously, Andoni, Indyk, and Krauthgamer gave a (data-independent) locality-sensitive hashing scheme for $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ when $p \in [1,2]$ with approximation $O(\log^2 s)$. By being data-dependent, we improve the approximation to $\tilde{O}(\log s)$. Our main technical contribution is to show that for any distribution $\mu$ supported on the metric $\mathsf{EMD}_s(\mathbb{R}^d, \ell_p)$, there exists a data-dependent LSH for dense regions of $\mu$ which achieves approximation $\tilde{O}(\log s)$, and that the data-independent LSH actually achieves a $\tilde{O}(\log s)$-approximation outside of those dense regions. Finally, we show how to "glue" together these two hashing schemes without any additional loss in the approximation. Beyond nearest neighbor search, our data-dependent LSH also gives optimal (distributional) sketches for the Earth Mover's Distance. By known sketching lower bounds, this implies that our LSH is optimal (up to $\mathrm{poly}(\log \log s)$ factors) among those that collide close points with constant probability.
翻译:我们提出了面向地球移动距离($\mathsf{EMD}$)的新的基于数据的局部敏感哈希方案(LSH),并由此将$\mathsf{EMD}$下近邻搜索的最佳近似比提升了二次因子。这里,度量空间$\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$由$\mathbb{R}^d$中的$s$个向量构成的集合组成,对于任意两个由$s$个向量构成的集合$x,y$,距离$\mathsf{EMD}(x,y)$是$x$与$y$之间完美匹配的最小代价,其中匹配两个向量的代价为它们的$\ell_p$距离。此前,Andoni、Indyk和Krauthgamer针对$p \in [1,2]$时的$\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$给出了一种(基于数据的)局部敏感哈希方案,近似比为$O(\log^2 s)$。通过采用基于数据的方法,我们将近似比改进为$\tilde{O}(\log s)$。我们的主要技术贡献在于证明:对于任意支撑在度量空间$\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$上的分布$\mu$,存在一种针对$\mu$稠密区域的基于数据的LSH,其近似比为$\tilde{O}(\log s)$;而基于数据的LSH实际上在这些稠密区域之外也能达到$\tilde{O}(\log s)$的近似比。最后,我们展示了如何在不损失近似比的情况下将这两种哈希方案“粘合”在一起。除了近邻搜索,我们的基于数据的LSH还为地球移动距离提供了最优的(分布式)草图。根据已知的草图下界,这表明我们的LSH(在$\mathrm{poly}(\log \log s)$因子范围内)是在以常数概率碰撞相近点的那类LSH中的最优方案。