In this paper, we introduce strategies for developing private Key Information Extraction (KIE) systems by leveraging large pretrained document foundation models in conjunction with differential privacy (DP), federated learning (FL), and Differentially Private Federated Learning (DP-FL). Through extensive experimentation on six benchmark datasets (FUNSD, CORD, SROIE, WildReceipts, XFUND, and DOCILE), we demonstrate that large document foundation models can be effectively fine-tuned for the KIE task under private settings to achieve adequate performance while maintaining strong privacy guarantees. Moreover, by thoroughly analyzing the impact of various training and model parameters on model performance, we propose simple yet effective guidelines for achieving an optimal privacy-utility trade-off for the KIE task under global DP. Finally, we introduce FeAm-DP, a novel DP-FL algorithm that enables efficiently upscaling global DP from a standalone context to a multi-client federated environment. We conduct a comprehensive evaluation of the algorithm across various client and privacy settings, and demonstrate its capability to achieve comparable performance and privacy guarantees to standalone DP, even when accommodating an increasing number of participating clients. Overall, our study offers valuable insights into the development of private KIE systems, and highlights the potential of document foundation models for privacy-preserved Document AI applications. To the best of authors' knowledge, this is the first work that explores privacy preserved document KIE using document foundation models.
翻译:本文提出了利用大型预训练文档基础模型结合差分隐私(DP)、联邦学习(FL)及差分隐私联邦学习(DP-FL)构建隐私关键信息提取(KIE)系统的策略。通过在六个基准数据集(FUNSD、CORD、SROIE、WildReceipts、XFUND和DOCILE)上的广泛实验,我们证明大型文档基础模型可在隐私设置下有效微调以完成KIE任务,在保持强隐私保障的同时获得足够性能。此外,通过深入分析各类训练和模型参数对性能的影响,我们提出了简单而有效的准则,用于在全局差分隐私下实现KIE任务的最优隐私-效用权衡。最后,我们提出了FeAm-DP算法——一种新型DP-FL算法,能够将全局DP从独立环境高效扩展至多客户端联邦环境。我们针对不同客户端和隐私设置对该算法进行了全面评估,证明其即使在参与客户端数量增加的情况下,也能实现与独立DP相当的性能和隐私保障。总体而言,本研究为隐私KIE系统的开发提供了宝贵见解,并揭示了文档基础模型在隐私保护文档AI应用中的潜力。据作者所知,这是首个探索基于文档基础模型的隐私保护文档KIE的研究工作。