Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.
翻译:考虑两个$D$维数据向量(例如嵌入向量):$u, v$。在许多基于嵌入的检索(EBR)应用中,向量由训练模型生成,$D=256\sim 1024$较为常见。本文提出的OPORP(一次置换+一次随机投影)采用"计数草图"类数据结构的变体实现数据降维/压缩。OPORP首先对数据向量应用置换,独立同分布生成随机向量$r$,其矩满足:$E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$。将$r$与所有置换后的数据向量进行点积运算,然后将$D$列划分为$k$个等长区间,通过聚合(即求和)每个区间内的值,从每个数据向量得到$k$个样本。关键步骤是将这$k$个样本归一化至单位$l_2$范数。我们证明估计方差本质上为:$(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$,其中$A\geq 0$是数据$(u,v)$的函数。该公式揭示了几个关键性质:(1)需要$s=1$;(2)因子$\frac{D-k}{D-1}$在降低方差方面非常有益;(3)项$\frac{1}{k}(1-\rho^2)^2$实际上是经典相关系数估计量的渐近方差。我们证明,令OPORP中的$k=1$并重复该过程$m$次,可精确恢复"超稀疏随机投影"(VSRP)方法。这直接导出VSRP的归一化估计量,显著改进了原始VSRP估计量。总之,通过OPORP,(i)归一化步骤和(ii)定长分箱方案两个关键步骤显著提高了余弦相似度估计的精度,而这正是现代基于嵌入的检索(EBR)应用中的常规(且关键)任务。