FedDebug: Systematic Debugging for Federated Learning Applications

In Federated Learning (FL), clients independently train local models and share them with a central aggregator to build a global model. Impermissibility to access clients' data and collaborative training make FL appealing for applications with data-privacy concerns, such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, identifying the responsible rounds and clients is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the global model's accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug's breakpoint can help inspect an FL state (round, client, and global model) and move between rounds and clients' models seamlessly, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client(s) responsible for lowering the global model's performance without any testing data and labels--both are essential for existing debugging techniques. FedDebug's strengths come from adapting differential testing in conjunction with neuron activations to determine the client(s) deviating from normal behavior. FedDebug achieves 100% accuracy in finding a single faulty client and 90.3% accuracy in finding multiple faulty clients. FedDebug's interactive debugging incurs 1.2% overhead during training, while it localizes a faulty client in only 2.1% of a round's training time.

翻译：在联邦学习（FL）中，客户端独立训练本地模型并与中央聚合器共享以构建全局模型。由于无法访问客户端数据及协同训练的特性，FL对医疗影像等具有数据隐私需求的应用极具吸引力。然而，这些FL特性也为调试带来了前所未有的挑战。当全局模型性能下降时，识别导致问题的轮次和客户端成为主要痛点。开发者只能通过对客户端子集进行试错调试，期望提升全局模型准确率或依靠后续FL轮次重新调整模型，这种方法耗时且成本高昂。我们设计了系统性故障定位框架FedDebug，从两个创新维度推进FL调试：首先，FedDebug通过利用记录回放技术构建模拟实时FL环境的仿真系统，实现对FL协同训练过程的交互式调试。其断点机制可检查FL状态（轮次、客户端和全局模型），并实现轮次与客户端模型之间的无缝切换，支持细粒度的逐步检查。其次，FedDebug无需任何测试数据和标签即可自动识别导致全局模型性能下降的客户端——这两者都是现有调试技术的关键依赖。FedDebug的优势源于将差分测试与神经元激活相结合，以判定偏离正常行为的客户端。该框架对单个故障客户端的定位准确率达100%，对多个故障客户端的定位准确率达90.3%。交互式调试仅在训练过程中引入1.2%的额外开销，而定位故障客户端仅需单个轮次训练时长的2.1%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日