We introduce average-distortion sketching for metric spaces. As in (worst-case) sketching, these algorithms compress points in a metric space while approximately recovering pairwise distances. The novelty is studying average-distortion: for any fixed (yet, arbitrary) distribution $\mu$ over the metric, the sketch should not over-estimate distances, and it should (approximately) preserve the average distance with respect to draws from $\mu$. The notion generalizes average-distortion embeddings into $\ell_1$ [Rabinovich '03, Kush-Nikolov-Tang '21] as well as data-dependent locality-sensitive hashing [Andoni-Razenshteyn '15, Andoni-Naor-Nikolov-et-al. '18], which have been recently studied in the context of nearest neighbor search. $\bullet$ For all $p \in [1, \infty)$ and any $c$ larger than a fixed constant, we give an average-distortion sketch for $([\Delta]^d, \ell_p)$ with approximation $c$ and bit-complexity $\text{poly}(cp \cdot 2^{p/c} \cdot \log(d\Delta))$, which is provably impossible in (worst-case) sketching. $\bullet$ As an application, we improve on the approximation of sublinear-time data structures for nearest neighbor search over $\ell_p$ (for large $p > 2$). The prior best approximation was $O(p)$ [Andoni-Naor-Nikolov-et-al. '18, Kush-Nikolov-Tang '21], and we show it can be any $c$ larger than a fixed constant (irrespective of $p$) by using $n^{\text{poly}(cp \cdot 2^{p/c})}$ space. We give some evidence that $2^{\Omega(p/c)}$ space may be necessary by giving a lower bound on average-distortion sketches which produce a certain probabilistic certificate of farness (which our sketches crucially rely on).
翻译:我们引入了度量空间的平均失真草图算法。与(最坏情况)草图算法类似,这些算法压缩度量空间中的点,同时近似恢复成对距离。其新颖之处在于研究平均失真:对于度量上任意固定(但任意)的分布 $\mu$,草图不应高估距离,并且应(近似)保持从 $\mu$ 中抽取样本的平均距离。这一概念将平均失真嵌入到 $\ell_1$ [Rabinovich '03, Kush-Nikolov-Tang '21] 以及数据依赖的局部敏感哈希 [Andoni-Razenshteyn '15, Andoni-Naor-Nikolov-et-al. '18] 进行了推广,这些方法最近在最近邻搜索的背景下得到了研究。$\bullet$ 对于所有 $p \in [1, \infty)$ 以及任何大于固定常数的 $c$,我们为 $([\Delta]^d, \ell_p)$ 给出了一个近似度为 $c$、比特复杂度为 $\text{poly}(cp \cdot 2^{p/c} \cdot \log(d\Delta))$ 的平均失真草图,这在(最坏情况)草图算法中被证明是不可能的。$\bullet$ 作为应用,我们改进了 $\ell_p$(对于较大的 $p > 2$)上最近邻搜索的亚线性时间数据结构的近似度。先前的最佳近似度为 $O(p)$ [Andoni-Naor-Nikolov-et-al. '18, Kush-Nikolov-Tang '21],我们通过使用 $n^{\text{poly}(cp \cdot 2^{p/c})}$ 的空间,证明了近似度可以是任何大于固定常数的 $c$(与 $p$ 无关)。我们通过给出平均失真草图的下界(该草图生成一种关键的远距离概率性证明,我们的草图算法依赖于该证明)提供了一些证据,表明 $2^{\Omega(p/c)}$ 的空间可能是必要的。