We introduce average-distortion sketching for metric spaces. As in (worst-case) sketching, these algorithms compress points in a metric space while approximately recovering pairwise distances. The novelty is studying average-distortion: for any fixed (yet, arbitrary) distribution $\mu$ over the metric, the sketch should not over-estimate distances, and it should (approximately) preserve the average distance with respect to draws from $\mu$. The notion generalizes average-distortion embeddings into $\ell_1$ [Rabinovich '03, Kush-Nikolov-Tang '21] as well as data-dependent locality-sensitive hashing [Andoni-Razenshteyn '15, Andoni-Naor-Nikolov-et-al. '18], which have been recently studied in the context of nearest neighbor search. $\bullet$ For all $p \in [1, \infty)$ and any $c$ larger than a fixed constant, we give an average-distortion sketch for $([\Delta]^d, \ell_p)$ with approximation $c$ and bit-complexity $\text{poly}(cp \cdot 2^{p/c} \cdot \log(d\Delta))$, which is provably impossible in (worst-case) sketching. $\bullet$ As an application, we improve on the approximation of sublinear-time data structures for nearest neighbor search over $\ell_p$ (for large $p > 2$). The prior best approximation was $O(p)$ [Andoni-Naor-Nikolov-et.al '18, Kush-Nikolov-Tang '21], and we show it can be any $c$ larger than a fixed constant (irrespective of $p$) by using $n^{\text{poly}(cp \cdot 2^{p/c})}$ space. We give some evidence that $2^{\Omega(p/c)}$ space may be necessary by giving a lower bound on average-distortion sketches which produce a certain probabilistic certificate of farness (which our sketches crucially rely on).
翻译:我们引入了度量空间中的平均失真草图算法。与(最坏情况)草图算法类似,这些算法压缩度量空间中的点,同时近似恢复点对距离。其新颖之处在于研究平均失真:对于度量空间上任意固定(但可任意选择)的分布 $\mu$,草图不应高估距离,并应(近似)保持从 $\mu$ 中抽取样本的平均距离。该概念将平均失真嵌入到 $\ell_1$ 空间[Rabinovich '03, Kush-Nikolov-Tang '21]以及数据依赖的局部敏感哈希[Andoni-Razenshteyn '15, Andoni-Naor-Nikolov-et-al. '18]进行了推广,这些方法最近在最近邻搜索的背景下得到了研究。$\bullet$ 对于所有 $p \in [1, \infty)$ 及大于固定常数的任意 $c$,我们为 $([\Delta]^d, \ell_p)$ 构造了近似比为 $c$、比特复杂度为 $\text{poly}(cp \cdot 2^{p/c} \cdot \log(d\Delta))$ 的平均失真草图,这在(最坏情况)草图算法中被证明是不可能的。$\bullet$ 作为应用,我们改进了 $\ell_p$ 空间(针对较大 $p > 2$)上最近邻搜索的亚线性时间数据结构的近似比。先前最佳近似比为 $O(p)$ [Andoni-Naor-Nikolov-et.al '18, Kush-Nikolov-Tang '21],我们证明通过使用 $n^{\text{poly}(cp \cdot 2^{p/c})}$ 空间,该近似比可以提升为任意大于固定常数的 $c$(与 $p$ 无关)。我们通过给出平均失真草图算法的下界(该算法能生成特定的远距离概率性证明,我们的草图算法关键依赖于该证明),为 $2^{\Omega(p/c)}$ 空间可能是必要的提供了证据。