This paper studies rule-based blocking in Entity Resolution (ER). We propose HyperBlocker, a GPU-accelerated system for blocking in ER. As opposed to previous blocking algorithms and parallel blocking solvers, HyperBlocker employs a pipelined architecture to overlap data transfer and GPU operations. It generates a dataaware and rule-aware execution plan on CPUs, for specifying how rules are evaluated, and develops a number of hardware-aware optimizations to achieve massive parallelism on GPUs. Using reallife datasets, we show that HyperBlocker is at least 6.8x and 9.1x faster than prior CPU-powered distributed systems and GPU-based ER solvers, respectively. Better still, by combining HyperBlocker with the state-of-the-art ER matcher, we can speed up the overall ER process by at least 30% with comparable accuracy.
翻译:本文研究实体解析(ER)中基于规则的阻塞技术。我们提出HyperBlocker,一种用于ER阻塞的GPU加速系统。与以往的阻塞算法及并行阻塞求解器不同,HyperBlocker采用流水线架构以重叠数据传输与GPU操作。该系统在CPU上生成数据感知与规则感知的执行计划,用于指定规则评估方式,并开发了多种硬件感知优化技术以在GPU上实现大规模并行处理。基于真实数据集实验表明,HyperBlocker相比先前基于CPU的分布式系统和基于GPU的ER求解器,分别至少快6.8倍和9.1倍。更优的是,通过将HyperBlocker与最先进的ER匹配器结合,我们能在保持相当精度的前提下,将整体ER流程加速至少30%。