Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.

翻译：联邦学习（FL）允许多方在不集中原始数据的情况下协作训练模型。FL存在两种主要范式：横向联邦学习（HFL，所有参与者共享相同特征空间但持有不同样本）与纵向联邦学习（VFL，各方针对同一组样本拥有互补特征）。VFL训练的前提是隐私保护实体对齐（PPEA），即在不泄露样本交集关系的前提下，建立跨参与方的样本公共索引（对齐）。传统私有集合交集（PSI）虽能实现对齐，但会泄露交集成员关系，暴露数据集间的敏感关联。标准私有集合并集（PSU）通过基于标识符并集而非交集进行对齐来缓解此风险。然而，现有方法通常局限于两方场景或缺乏对容错匹配的支持。本文针对VFL场景提出谢尔帕人工智能多方PSU协议，这是一种能隐藏交集成员关系并支持精确匹配与含噪匹配的PPEA方法。该协议以低通信开销将两方方法推广至多方场景，并提供两种变体：用于精确对齐的保序版本与容忍拼写及格式差异的无序版本。我们证明了正确性与隐私性，分析了通信与计算（指数运算）复杂度，并形式化了一种将本地记录映射至共享索引空间的通用索引映射。该多方PSU为真实VFL部署（如跨机构医疗疾病检测、银行与保险公司协同风险建模、电信与金融机构跨域欺诈检测）提供了可扩展且数学严谨的PPEA协议，同时保护交集隐私。