Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying $K$ random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around $\epsilon = 5\sim 10$, where $\epsilon$ is the standard parameter in the language of $(\epsilon, \delta)$-DP.
翻译:最小哈希(MinHash)是工业界广泛应用于大规模搜索和学习任务的标准算法,适用于二进制(0/1)Jaccard相似度计算。MinHash的常见用途之一是处理海量n-gram文本表示,从而避免实体化原始数据(否则代价过高)。另一常用场景是构建哈希表以实现亚线性时间的近似近邻搜索。MinHash还被用作构建大规模机器学习系统的工具。标准MinHash实现需要应用$K$次随机排列。相比之下,单排列哈希(OPH)方法作为MinHash的高效替代方案,将数据向量分割为$K$个分箱并在每个分箱内生成哈希值。OPH在效率和使用便捷性上具有显著优势。本文通过将差分隐私(DP)与OPH(以及MinHash)相结合,提出了DP-OPH框架及其三种变体:DP-OPH-fix、DP-OPH-re和DP-OPH-rand,具体采用哪种处理空箱的稠密化策略取决于实际需求。我们给出了详细的算法设计路线图及隐私分析。通过将所提出的DP-OPH方法与差分隐私最小哈希(DP-MH)进行解析比较,论证了DP-OPH的优势。相似性搜索实验验证了DP-OPH的性能优势,并为不同实际场景下选择合适变体提供了指导。该技术进一步扩展至分箱一致加权采样(BCWS),提出了适用于非二进制数据的差分隐私算法DP-BCWS。分类任务实验表明,DP-BCWS在$(\epsilon, \delta)$-DP框架的标准参数$\epsilon \approx 5\sim 10$范围内能实现优异的效用。