A set of vectors $S \subseteq \mathbb{R}^d$ is $(k_1,\varepsilon)$-clusterable if there are $k_1$ balls of radius $\varepsilon$ that cover $S$. A set of vectors $S \subseteq \mathbb{R}^d$ is $(k_2,\delta)$-far from being clusterable if there are at least $k_2$ vectors in $S$, with all pairwise distances at least $\delta$. We propose a probabilistic algorithm to distinguish between these two cases. Our algorithm reaches a decision by only looking at the extreme values of a scalar valued hash function, defined by a random field, on $S$; hence, it is especially suitable in distributed and online settings. An important feature of our method is that the algorithm is oblivious to the number of vectors: in the online setting, for example, the algorithm stores only a constant number of scalars, which is independent of the stream length. We introduce random field hash functions, which are a key ingredient in our paradigm. Random field hash functions generalize locality-sensitive hashing (LSH). In addition to the LSH requirement that ``nearby vectors are hashed to similar values", our hash function also guarantees that the ``hash values are (nearly) independent random variables for distant vectors". We formulate necessary conditions for the kernels which define the random fields applied to our problem, as well as a measure of kernel optimality, for which we provide a bound. Then, we propose a method to construct kernels which approximate the optimal one.
翻译:一组向量 $S \subseteq \mathbb{R}^d$ 称为 $(k_1,\varepsilon)$-可聚类的,如果存在 $k_1$ 个半径为 $\varepsilon$ 的球覆盖 $S$。一组向量 $S \subseteq \mathbb{R}^d$ 称为 $(k_2,\delta)$-远非可聚类的,如果 $S$ 中至少有 $k_2$ 个向量,且它们两两之间的距离至少为 $\delta$。我们提出一种概率算法来区分这两种情况。我们的算法仅通过观察由随机场定义的标量哈希函数在 $S$ 上的极值来做出决策;因此,它特别适用于分布式和在线设置。我们方法的一个重要特点是该算法对向量数量不敏感:例如,在线设置中,算法仅存储常数个标量,该标量与数据流长度无关。我们引入了随机场哈希函数,这是我们范式的关键组成部分。随机场哈希函数推广了局部敏感哈希(LSH)。除了LSH要求“邻近向量被哈希到相似值”外,我们的哈希函数还保证了“对于距离较远的向量,哈希值(几乎)是独立的随机变量”。我们制定了定义应用于我们问题的随机场所需核函数的必要条件,以及核函数最优性的度量,并给出了其界限。然后,我们提出了一种构造近似最优核函数的方法。