Fast Similarity Sketching

from arxiv, The original version was directly based on a conference paper of the same title from FOCS'17. This new version is substantially revised with some cleaner and stronger theorems, particularly concerning the high probability domain. Moreover, there is one more author, Jakob Houen. In addition, one of the old authors, Mathias, has changed surname from Knudsen to Langhede

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) = |A\cap B|/|A\cup B|$ between sets $A$ and $B$ is preserved. More precisely, define $X_i = [S(A)[i] = S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want $E[X_i]=J(A,B)$, and we want $X$ to be strongly concentrated around $E[X] = t \cdot J(A,B)$ (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. Strong concentration is critical, for often we want to sketch many sets $B_1,\ldots,B_n$ so that we later, for a query set $A$, can find (one of) the most similar $B_i$. It is then critical that no $B_i$ looks much more similar to $A$ due to errors in the sketch. The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

翻译：我们考虑**相似性草图**问题：给定一个全集 $[u] = \{0,\ldots, u-1\}$，我们希望构造一个随机函数 $S$，将子集 $A\subseteq [u]$ 映射为大小为 $t$ 的向量 $S(A)$，使得集合 $A$ 与 $B$ 的杰卡德相似度 $J(A,B) = |A\cap B|/|A\cup B|$ 得以保持。更精确地说，定义 $X_i = [S(A)[i] = S(B)[i]]$ 及 $X = \sum_{i\in [t]} X_i$。我们要求 $E[X_i]=J(A,B)$，并且 $X$ 强集中在 $E[X] = t \cdot J(A,B)$ 附近（即满足切尔诺夫型界）。这是一个基础性问题，通过经典的 MinHash 算法，在数据挖掘、大规模分类、计算机视觉、相似性搜索等领域有着广泛应用。向量 $S(A)$ 亦称为**草图**。强集中性至关重要，因为我们通常需要对多个集合 $B_1,\ldots,B_n$ 进行草图化，以便后续对查询集合 $A$ 找出最相似的 $B_i$（之一）。此时，任何 $B_i$ 因草图误差而显得与 $A$ 更为相似的情况是必须避免的。奠基性的 $t\times\textit{MinHash}$ 算法使用 $t$ 个随机哈希函数 $h_1,\ldots, h_t$，并将 $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ 存储为 $A$ 的草图。然而，MinHash 的主要缺陷在于其 $O(t\cdot |A|)$ 的运行时间，寻找具有类似性质且运行速度更快的草图已成为多篇论文的研究主题。（未完待续）