Privacy-Preserving Collaborative Chinese Text Recognition with Federated Learning

In Chinese text recognition, to compensate for the insufficient local data and improve the performance of local few-shot character recognition, it is often necessary for one organization to collect a large amount of data from similar organizations. However, due to the natural presence of private information in text data, different organizations are unwilling to share private data, such as addresses and phone numbers. Therefore, it becomes increasingly important to design a privacy-preserving collaborative training framework for the Chinese text recognition task. In this paper, we introduce personalized federated learning (pFL) into the Chinese text recognition task and propose the pFedCR algorithm, which significantly improves the model performance of each client (organization) without sharing private data. Specifically, based on CRNN, to handle the non-iid problem of client data, we add several attention layers to the model and design a two-stage training approach for the client. In addition, we fine-tune the output layer of the model using a virtual dataset on the server, mitigating the problem of character imbalance in Chinese documents. The proposed approach is validated on public benchmarks and two self-built real-world industrial scenario datasets. The experimental results show that the pFedCR algorithm can improve the performance of local personalized models while also improving their generalization performance on other client data domains. Compared to local training within an organization, pFedCR improves model performance by about 20%. Compared to other state-of-the-art personalized federated learning methods, pFedCR improves performance by 6%~8%. Moreover, through federated learning, pFedCR can correct erroneous information in the ground truth.

翻译：在中文文本识别任务中，为弥补局部数据不足并提升本地少样本字符识别性能，通常需要单个机构从相似机构收集大量数据。然而，由于文本数据天然包含隐私信息（如地址、电话号码），不同机构不愿共享私有数据。因此，设计一种保护隐私的协作式中文文本识别训练框架变得日益重要。本文首次将个性化联邦学习引入中文文本识别任务，提出pFedCR算法，在不共享私有数据的前提下显著提升各客户端（机构）的模型性能。具体而言，基于CRNN架构，为处理客户端数据的非独立同分布问题，我们在模型中增加多个注意力层，并设计客户端的两阶段训练策略。此外，在服务器端利用虚拟数据集对模型输出层进行微调，缓解中文文档中字符不平衡问题。所提方法在公开基准数据集和两个自建真实工业场景数据集上得到验证。实验结果表明，pFedCR算法既能提升本地个性化模型性能，又能增强其在其他客户端数据域上的泛化能力。相较于机构内本地训练，pFedCR使模型性能提升约20%；与其他最优个性化联邦学习方法相比，性能提升6%~8%。此外，通过联邦学习，pFedCR还能修正真实标注中的错误信息。