Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

The emergence of large-scale foundation models (FoMo's) that can perform human-like intelligence motivates their deployment at the network edge for devices to access state-of-the-art artificial intelligence. For better user experiences, the pre-trained FoMo's need to be adapted to specialized downstream tasks through fine-tuning techniques. To transcend a single device's memory and computation limitations, we advocate multi-device cooperation within the device-edge cooperative fine-tuning (DEFT) paradigm, where edge devices cooperate to simultaneously optimize different parts of fine-tuning parameters within a FoMo. However, the parameter blocks reside at different depths within a FoMo architecture, leading to varied computation latency-and-memory cost due to gradient backpropagation-based calculations. The heterogeneous on-device computation and memory capacities and channel conditions necessitate an integrated communication-and-computation allocation of local computation loads and communication resources to achieve low-latency (LoLa) DEFT. To this end, we consider the depth-ware DEFT block allocation problem. The involved optimal block-device matching is tackled by the proposed low-complexity Cutting-RecoUNting-CHecking (CRUNCH) algorithm, which is designed by exploiting the monotone-increasing property between block depth and computation latency-and-memory cost. Next, the joint bandwidth-and-block allocation makes the problem more sophisticated. We observe a splittable Lagrangian expression through the transformation and analysis of the original problem, where the variables indicating device involvement are introduced. Then, the dual ascent method is employed to tackle this problem iteratively. Through extensive experiments conducted on the GLUE benchmark, our results demonstrate significant latency reduction achievable by LoLa DEFT for fine-tuning a RoBERTa model.

翻译：能够执行类人智能的大规模基础模型（FoMo）的出现，推动了其在网络边缘的部署，使设备能够访问最先进的人工智能。为了获得更好的用户体验，预训练的基础模型需要通过微调技术适应特定的下游任务。为了超越单个设备的内存和计算限制，我们倡导在设备-边缘协同微调（DEFT）范式内进行多设备协作，其中边缘设备协作同时优化基础模型内微调参数的不同部分。然而，参数块位于基础模型架构的不同深度，由于基于梯度反向传播的计算，导致了不同的计算延迟和内存开销。设备间异构的计算与内存能力以及信道条件，需要对本地计算负载和通信资源进行集成化的通信-计算分配，以实现低延迟（LoLa）DEFT。为此，我们考虑了深度感知的DEFT块分配问题。所涉及的最优块-设备匹配问题，通过提出的低复杂度Cutting-RecoUNting-CHecking（CRUNCH）算法解决，该算法通过利用块深度与计算延迟-内存开销之间的单调递增特性而设计。接下来，联合带宽与块分配使得问题更加复杂。我们通过对引入设备参与度指示变量的原问题进行变换和分析，观察到一个可分离的拉格朗日表达式。然后，采用对偶上升法迭代求解该问题。通过在GLUE基准上进行的大量实验，我们的结果表明，LoLa DEFT在微调RoBERTa模型时可实现显著的延迟降低。