Impact of Redundancy on Resilience in Distributed Optimization and Learning

from arxiv, 49 pages, 2 figures, 2 tables. Updated with the full version of the paper, updated results in Section 4 and Appendix C, and other minor fixings. arXiv admin note: substantial text overlap with arXiv:2110.10858

This report considers the problem of resilient distributed optimization and stochastic learning in a server-based architecture. The system comprises a server and multiple agents, where each agent has its own local cost function. The agents collaborate with the server to find a minimum of the aggregate of the local cost functions. In the context of stochastic learning, the local cost of an agent is the loss function computed over the data at that agent. In this report, we consider this problem in a system wherein some of the agents may be Byzantine faulty and some of the agents may be slow (also called stragglers). In this setting, we investigate the conditions under which it is possible to obtain an "approximate" solution to the above problem. In particular, we introduce the notion of $(f, r; \epsilon)$-resilience to characterize how well the true solution is approximated in the presence of up to $f$ Byzantine faulty agents, and up to $r$ slow agents (or stragglers) -- smaller $\epsilon$ represents a better approximation. We also introduce a measure named $(f, r; \epsilon)$-redundancy to characterize the redundancy in the cost functions of the agents. Greater redundancy allows for a better approximation when solving the problem of aggregate cost minimization. In this report, we constructively show (both theoretically and empirically) that $(f, r; \mathcal{O}(\epsilon))$-resilience can indeed be achieved in practice, given that the local cost functions are sufficiently redundant.

翻译：本报告考虑基于服务器架构的弹性分布式优化与随机学习问题。系统由服务器和多个智能体组成，每个智能体拥有各自局部代价函数。智能体与服务器协作以寻找局部代价函数之和的最小值。在随机学习场景中，智能体的局部代价函数是基于该智能体数据计算得到的损失函数。本报告考虑系统中部分智能体可能遭受拜占庭故障、部分智能体可能运行缓慢（即掉队者）的情形。在此设定下，我们探究能够获得上述问题"近似解"的条件。具体而言，我们引入$(f, r; \epsilon)$-弹性概念，以表征在至多$f$个拜占庭故障智能体与至多$r$个慢速智能体（掉队者）并存条件下真实解的逼近程度——较小的$\epsilon$表示更优的近似。同时提出$(f, r; \epsilon)$-冗余度量，用以刻画智能体代价函数的冗余程度。当求解聚合代价最小化问题时，更高的冗余度允许获得更优的近似解。本报告通过理论分析与实验验证，构造性地证明：在局部代价函数具有充分冗余的条件下，$(f, r; \mathcal{O}(\epsilon))$-弹性确实可以在实际系统中实现。