In Federated Learning (FL), clients independently train local models and share them with a central aggregator to build a global model. Impermissibility to access clients' data and collaborative training make FL appealing for applications with data-privacy concerns, such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, identifying the responsible rounds and clients is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the global model's accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug's breakpoint can help inspect an FL state (round, client, and global model) and move between rounds and clients' models seamlessly, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client(s) responsible for lowering the global model's performance without any testing data and labels--both are essential for existing debugging techniques. FedDebug's strengths come from adapting differential testing in conjunction with neuron activations to determine the client(s) deviating from normal behavior. FedDebug achieves 100% accuracy in finding a single faulty client and 90.3% accuracy in finding multiple faulty clients. FedDebug's interactive debugging incurs 1.2% overhead during training, while it localizes a faulty client in only 2.1% of a round's training time.
翻译:在联邦学习(FL)中,客户端独立训练本地模型并与中央聚合器共享以构建全局模型。由于无法访问客户端数据及协同训练的特性,FL对医疗影像等具有数据隐私需求的应用极具吸引力。然而,这些FL特性也为调试带来了前所未有的挑战。当全局模型性能下降时,识别导致问题的轮次和客户端成为主要痛点。开发者只能通过对客户端子集进行试错调试,期望提升全局模型准确率或依靠后续FL轮次重新调整模型,这种方法耗时且成本高昂。我们设计了系统性故障定位框架FedDebug,从两个创新维度推进FL调试:首先,FedDebug通过利用记录回放技术构建模拟实时FL环境的仿真系统,实现对FL协同训练过程的交互式调试。其断点机制可检查FL状态(轮次、客户端和全局模型),并实现轮次与客户端模型之间的无缝切换,支持细粒度的逐步检查。其次,FedDebug无需任何测试数据和标签即可自动识别导致全局模型性能下降的客户端——这两者都是现有调试技术的关键依赖。FedDebug的优势源于将差分测试与神经元激活相结合,以判定偏离正常行为的客户端。该框架对单个故障客户端的定位准确率达100%,对多个故障客户端的定位准确率达90.3%。交互式调试仅在训练过程中引入1.2%的额外开销,而定位故障客户端仅需单个轮次训练时长的2.1%。