Beyond Worst-Case Dimensionality Reduction for Sparse Vectors

We study beyond worst-case dimensionality reduction for $s$-sparse vectors. Our work is divided into two parts, each focusing on a different facet of beyond worst-case analysis: We first consider average-case guarantees. A folklore upper bound based on the birthday-paradox states: For any collection $X$ of $s$-sparse vectors in $\mathbb{R}^d$, there exists a linear map to $\mathbb{R}^{O(s^2)}$ which \emph{exactly} preserves the norm of $99\%$ of the vectors in $X$ in any $\ell_p$ norm (as opposed to the usual setting where guarantees hold for all vectors). We give lower bounds showing that this is indeed optimal in many settings: any oblivious linear map satisfying similar average-case guarantees must map to $\Omega(s^2)$ dimensions. The same lower bound also holds for a wide class of smooth maps, including `encoder-decoder schemes', where we compare the norm of the original vector to that of a smooth function of the embedding. These lower bounds reveal a separation result, as an upper bound of $O(s \log(d))$ is possible if we instead use arbitrary (possibly non-smooth) functions, e.g., via compressed sensing algorithms. Given these lower bounds, we specialize to sparse \emph{non-negative} vectors. For a dataset $X$ of non-negative $s$-sparse vectors and any $p \ge 1$, we can non-linearly embed $X$ to $O(s\log(|X|s)/\epsilon^2)$ dimensions while preserving all pairwise distances in $\ell_p$ norm up to $1\pm \epsilon$, with no dependence on $p$. Surprisingly, the non-negativity assumption enables much smaller embeddings than arbitrary sparse vectors, where the best known bounds suffer exponential dependence. Our map also guarantees \emph{exact} dimensionality reduction for $\ell_{\infty}$ by embedding into $O(s\log |X|)$ dimensions, which is tight. We show that both the non-linearity of $f$ and the non-negativity of $X$ are necessary, and provide downstream algorithmic improvements.

翻译：我们研究了针对$s$稀疏向量的超越最坏情况降维问题。我们的工作分为两部分，每部分聚焦于超越最坏情况分析的不同维度：首先考虑平均情况保证。基于生日悖论的经典上界表明：对于$\mathbb{R}^d$中任意$s$稀疏向量集合$X$，存在到$\mathbb{R}^{O(s^2)}$的线性映射，能够在任意$\ell_p$范数下**精确**保持$X$中$99\%$向量的范数（这与通常保证对所有向量成立的情形不同）。我们给出的下界证明该结论在许多场景下确为最优：任何满足类似平均情况保证的随机线性映射必须映射到$\Omega(s^2)$维度。该下界同样适用于广泛的平滑映射类，包括“编码器-解码器方案”——其中我们比较原始向量范数与嵌入平滑函数输出范数。这些下界揭示了分离性结果：若改用任意（可能非平滑）函数（例如通过压缩感知算法），则可能实现$O(s \log(d))$的上界。基于这些下界，我们特别研究稀疏**非负**向量。对于非负$s$稀疏向量数据集$X$及任意$p \ge 1$，我们可以将$X$非线性嵌入$O(s\log(|X|s)/\epsilon^2)$维度，同时在$\ell_p$范数下保持所有成对距离在$1\pm \epsilon$范围内，且与$p$无关。令人惊讶的是，非负性假设使得嵌入维度远小于任意稀疏向量情形——后者已知最佳边界具有指数依赖关系。我们的映射还通过嵌入$O(s\log |X|)$维度实现$\ell_{\infty}$的**精确**降维，该维度是最优的。我们证明函数$f$的非线性与$X$的非负性均为必要条件，并提供了下游算法改进。