Multi-Head Adapter Routing for Cross-Task Generalization

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] ($\texttt{Poly}$) jointly learns an inventory of adapters and a routing function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation. In this paper, we investigate the role that adapter routing plays in its success and design new variants based on our findings. First, we build on the intuition that finer-grained routing provides more expressivity. Hence, we propose $\texttt{MHR}$ (Multi-Head Routing), which combines $\textit{subsets}$ of adapter parameters and outperforms $\texttt{Poly}$ under a comparable parameter budget; by only fine-tuning the routing function and not the adapters ($\texttt{MHR}$-$z$), we achieve competitive performance with extreme parameter efficiency. Second, we find that $\texttt{Poly}$/$\texttt{MHR}$ performance is a result of better multi-task optimization, rather than modular inductive biases that facilitate adapter recombination and local adaptation, as previously hypothesized. In fact, we find that $\texttt{MHR}$ exhibits higher gradient alignment between tasks than any other method. Since this implies that routing is only crucial during multi-task pre-training, we propose $\texttt{MHR}$-$\mu$, which discards routing and fine-tunes the average of the pre-trained adapters during few-shot adaptation. This establishes $\texttt{MHR}$-$\mu$ as an effective method for single-adapter fine-tuning.

翻译：参数高效微调（PEFT）通过在多任务训练集上预训练适配器，并随后对测试任务进行少样本适应，从而实现跨任务泛化。Polytropon [Ponti等人，2023]（$\texttt{Poly}$）联合学习一个适配器库和一个路由函数，在预训练和少样本适应过程中为每个任务选择（可变大小的）适配器子集。本文研究了适配器路由在成功实现跨任务泛化中的作用，并基于发现设计了新的变体。首先，我们基于更细粒度的路由能提供更强表达能力的直觉，提出了$\texttt{MHR}$（多头部路由），该模型组合适配器参数的$\textit{子集}$，在可比参数预算下优于$\texttt{Poly}$；仅微调路由函数而不微调配器（$\texttt{MHR}$-$z$）时，我们以极端的参数效率实现了具有竞争力的性能。其次，我们发现$\texttt{Poly}$/$\texttt{MHR}$的性能源于更好的多任务优化，而非先前假设的模块化归纳偏置（如促进适配器重组和局部适应）。实际上，我们发现$\texttt{MHR}$在任务间梯度对齐方面优于任何其他方法。由于这意味着路由仅在多任务预训练阶段至关重要，我们提出了$\texttt{MHR}$-$\mu$，该方法在少样本适应期间丢弃路由，仅微调预训练适配器的平均值。这使$\texttt{MHR}$-$\mu$成为单适配器微调的有效方法。