Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions primarily rely on generic similarity metrics such as Jaccard similarity, whereas practical use cases often demand complex and customized similarity rules surpassing the capabilities of generic similarity metrics. Consequently, designing LSH functions for these customized similarity rules presents considerable challenges. In this research, we propose a neuralization approach to enhance locality-sensitive hashing by training deep neural networks to serve as hashing functions for complex metrics. We assess the effectiveness of this approach within the context of the entity resolution problem, which frequently involves the use of task-specific metrics in real-world applications. Specifically, we introduce NLSHBlock (Neural-LSH Block), a novel blocking methodology that leverages pre-trained language models, fine-tuned with a novel LSH-based loss function. Through extensive evaluations conducted on a diverse range of real-world datasets, we demonstrate the superiority of NLSHBlock over existing methods, exhibiting significant performance improvements. Furthermore, we showcase the efficacy of NLSHBlock in enhancing the performance of the entity matching phase, particularly within the semi-supervised setting.
翻译:局部敏感哈希(LSH)是一种广泛应用于大规模数据处理(如最近邻搜索、实体解析和聚类)的基础算法技术。然而,由于需要针对特定度量精心设计哈希函数,其在某些真实场景中的应用受到限制。现有的基于LSH的实体分块方案主要依赖通用相似性度量(如Jaccard相似度),而实际应用常需要超越通用度量能力的复杂定制化相似性规则。因此,为这些定制化相似性规则设计LSH函数面临巨大挑战。在本研究中,我们提出一种神经化方法,通过训练深度神经网络作为复杂度量的哈希函数来增强局部敏感哈希。我们结合实体解析问题评估了该方法的有效性——该问题在实际应用中常涉及任务特定度量。具体而言,我们提出了NLSHBlock(神经LSH分块),这是一种新型分块方法论,利用预训练语言模型,并通过基于LSH的新型损失函数进行微调。通过在多样化真实数据集上的广泛评估,我们证明了NLSHBlock相比现有方法的优越性,展现出显著的性能提升。此外,我们还展示了NLSHBlock在半监督环境下提升实体匹配阶段性能的有效性。