The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.
翻译:软最大值污染专家混合模型(MoE)在以下场景中得到应用:当大规模预训练模型(作为固定专家)通过引入新的污染部分(即提示)进行下游任务微调时,该提示部分充当可训练的新专家。尽管该模型应用广泛且具有现实意义,现有文献尚未探讨软最大值污染MoE的理论性质。本文通过研究门控参数与提示参数的最大似然估计收敛速率,旨在揭示引入新提示进行微调的统计特性与潜在挑战。我们发现,当提示获取与预训练模型重叠的知识时(这一概念通过我们提出的新型可区分性分析框架予以精确刻画),这些参数的可估计性会受到损害。在预训练模型与提示模型满足可区分性条件下,我们推导了所有门控参数与提示参数的最小最大最优估计速率。反之,若可区分性条件不成立,由于估计速率依赖于提示模型向预训练模型的收敛速率,其收敛速度将显著减缓。最后,我们通过多项数值实验对理论发现进行了实证验证。