Entity resolution (ER) is a critical task in data management which identifies whether multiple records refer to the same real-world entity. Despite its significance across domains such as healthcare, finance, and machine learning, implementing effective ER systems remains challenging due to the abundance of methodologies and tools, leading to a paradox of choice for practitioners. This paper proposes Resolvi, a reference architecture aimed at enhancing extensibility, interoperability, and scalability in ER systems. By analyzing existing ER frameworks and literature, we establish a structured approach to designing ER solutions that address common challenges. Additionally, we explore best practices for system implementation and deployment strategies to facilitate largescale entity resolution. Through this work, we aim to provide a foundational blueprint that assists researchers and practitioners in developing robust, scalable ER systems while reducing the complexity of architectural decisions.
翻译:实体解析(Entity Resolution,ER)是数据管理中的关键任务,旨在判断多条记录是否指向现实世界中的同一实体。尽管实体解析在医疗健康、金融和机器学习等领域具有重要意义,但由于现有方法与工具众多,实践者常面临选择困境,导致构建高效的实体解析系统仍具挑战。本文提出Resolvi——一种旨在提升实体解析系统可扩展性、互操作性与可伸缩性的参考架构。通过分析现有实体解析框架及相关文献,我们建立了一种结构化设计方法,以应对实体解析中的常见挑战。此外,我们探讨了系统实现的最佳实践与部署策略,以促进大规模实体解析的实施。本工作旨在提供一个基础性蓝图,帮助研究者与实践者开发鲁棒、可扩展的实体解析系统,同时降低架构决策的复杂性。