Multitask learning is widely used in practice to train a low-resource target task by augmenting it with multiple related source tasks. Yet, naively combining all the source tasks with a target task does not always improve the prediction performance for the target task due to negative transfers. Thus, a critical problem in multitask learning is identifying subsets of source tasks that would benefit the target task. This problem is computationally challenging since the number of subsets grows exponentially with the number of source tasks; efficient heuristics for subset selection does not always capture the relationship between task subsets and multitask learning performances. In this paper, we introduce an efficient procedure to address this problem via surrogate modeling. In surrogate modeling, we sample (random) subsets of source tasks and precompute their multitask learning performances; Then, we approximate the precomputed performances with a linear regression model that can also be used to predict the multitask performance of unseen task subsets. We show theoretically and empirically that fitting this model only requires sampling linearly many subsets in the number of source tasks. The fitted model provides a relevance score between each source task and the target task; We use the relevance scores to perform subset selection for multitask learning by thresholding. Through extensive experiments, we show that our approach predicts negative transfers from multiple source tasks to target tasks much more accurately than existing task affinity measures. Additionally, we demonstrate that for five weak supervision datasets, our approach consistently improves upon existing optimization methods for multi-task learning.
翻译:多任务学习在实践中被广泛使用,通过增加多个相关的源任务来训练低资源目标任务。然而,简单地将所有源任务与目标任务结合并不总能提升目标任务的预测性能,因为存在负迁移。因此,多任务学习中的一个关键问题是识别能够使目标任务受益的源任务子集。该问题在计算上具有挑战性,因为子集数量随源任务数量呈指数增长;用于子集选择的高效启发式方法并不总能捕捉任务子集与多任务学习性能之间的关系。在本文中,我们引入了一种高效的流程,通过代理建模来解决这一问题。在代理建模中,我们采样(随机)源任务子集并预先计算它们的多任务学习性能;然后,我们用线性回归模型近似这些预先计算的性能,该模型也可用于预测未见任务子集的多任务性能。我们从理论和实证上证明,拟合该模型只需在源任务数量上采集线性数量的子集。拟合后的模型提供了每个源任务与目标任务之间的相关性分数;我们利用这些相关性分数通过阈值化进行多任务学习的子集选择。通过大量实验,我们表明,我们的方法相较于现有任务亲和度度量,能更准确地预测多个源任务到目标任务的负迁移。此外,我们展示了在五个弱监督数据集上,我们的方法在现有多任务学习优化方法基础上持续取得改进。