Federated learning is a growing field in the machine learning community due to its decentralized and private design. Model training in federated learning is distributed over multiple clients giving access to lots of client data while maintaining privacy. Then, a server aggregates the training done on these multiple clients without access to their data, which could be emojis widely used in any social media service and instant messaging platforms to express users' sentiments. This paper proposes federated learning-based multilingual emoji prediction in both clean and attack scenarios. Emoji prediction data have been crawled from both Twitter and SemEval emoji datasets. This data is used to train and evaluate different transformer model sizes including a sparsely activated transformer with either the assumption of clean data in all clients or poisoned data via label flipping attack in some clients. Experimental results on these models show that federated learning in either clean or attacked scenarios performs similarly to centralized training in multilingual emoji prediction on seen and unseen languages under different data sources and distributions. Our trained transformers perform better than other techniques on the SemEval emoji dataset in addition to the privacy as well as distributed benefits of federated learning.
翻译:联邦学习因其去中心化与隐私保护设计,成为机器学习领域一个快速发展的方向。联邦学习的模型训练分布在多个客户端上,使得在保护隐私的同时能够获取大量客户端数据。随后,服务器聚合这些客户端的训练结果,而无需访问其数据——这些数据可能包括社交媒体服务和即时通讯平台中广泛用于表达用户情感的表情符号。本文提出了基于联邦学习的多语言表情符号预测,涵盖清洁场景与攻击场景。表情符号预测数据来自Twitter和SemEval表情符号数据集。这些数据用于训练和评估不同规模的Transformer模型,包括稀疏激活Transformer。我们假设所有客户端的数据为清洁数据,或部分客户端的数据遭受标签翻转攻击(投毒数据)。实验结果表明,无论是在清洁场景还是攻击场景下,联邦学习在多语言表情符号预测(涵盖已见与未见语言、不同数据源与分布)中的表现与集中式训练相当。我们的Transformer模型在SemEval表情符号数据集上的性能优于其他技术,同时兼具联邦学习的隐私与分布式优势。