OPORP: One Permutation + One Random Projection

Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.

翻译：考虑两个$D$维数据向量（例如嵌入向量）：$u, v$。在许多基于嵌入的检索（EBR）应用中，向量由训练模型生成，$D=256\sim 1024$较为常见。本文提出的OPORP（一次置换+一次随机投影）采用"计数草图"类数据结构的变体实现数据降维/压缩。OPORP首先对数据向量应用置换，独立同分布生成随机向量$r$，其矩满足：$E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$。将$r$与所有置换后的数据向量进行点积运算，然后将$D$列划分为$k$个等长区间，通过聚合（即求和）每个区间内的值，从每个数据向量得到$k$个样本。关键步骤是将这$k$个样本归一化至单位$l_2$范数。我们证明估计方差本质上为：$(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$，其中$A\geq 0$是数据$(u,v)$的函数。该公式揭示了几个关键性质：（1）需要$s=1$；（2）因子$\frac{D-k}{D-1}$在降低方差方面非常有益；（3）项$\frac{1}{k}(1-\rho^2)^2$实际上是经典相关系数估计量的渐近方差。我们证明，令OPORP中的$k=1$并重复该过程$m$次，可精确恢复"超稀疏随机投影"（VSRP）方法。这直接导出VSRP的归一化估计量，显著改进了原始VSRP估计量。总之，通过OPORP，（i）归一化步骤和（ii）定长分箱方案两个关键步骤显著提高了余弦相似度估计的精度，而这正是现代基于嵌入的检索（EBR）应用中的常规（且关键）任务。