Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed systems susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance fault tolerance and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for input tensor and Kernel-Channel Coded Partitioning (KCCP) for filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC sub-tasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework's effectiveness in computational efficiency, fault tolerance, and scalability across various CNN architectures.
翻译:在资源受限设备上部署卷积神经网络(CNN)需要高效管理计算资源,通常通过易受落后节点延迟影响的分布式系统实现。本文提出弹性编码分布式卷积计算(FCDCC)框架,以增强分布式CNN的容错性与数值稳定性。我们将最初为矩阵乘法设计的循环旋转矩阵嵌入(CRME)编码分布式计算(CDC)扩展至高维张量卷积。针对所提出的数值稳定编码张量卷积(NSCTC)方案,我们同时提出两种新型编码划分方案:针对输入张量的自适应填充编码划分(APCP)与针对滤波器张量的核通道编码划分(KCCP)。这些策略实现了张量卷积的线性分解,并将其编码为CDC子任务,将模型并行性与编码冗余相结合以实现鲁棒高效的计算。理论分析揭示了通信成本与存储成本之间的最优权衡关系。实证结果验证了该框架在多种CNN架构中计算效率、容错能力与可扩展性方面的有效性。