Multi-task learning (MTL) models have demonstrated impressive results in computer vision, natural language processing, and recommender systems. Even though many approaches have been proposed, how well these approaches balance different tasks on each parameter still remains unclear. In this paper, we propose to measure the task dominance degree of a parameter by the total updates of each task on this parameter. Specifically, we compute the total updates by the exponentially decaying Average of the squared Updates (AU) on a parameter from the corresponding task.Based on this novel metric, we observe that many parameters in existing MTL methods, especially those in the higher shared layers, are still dominated by one or several tasks. The dominance of AU is mainly due to the dominance of accumulative gradients from one or several tasks. Motivated by this, we propose a Task-wise Adaptive learning rate approach, AdaTask in short, to separate the \emph{accumulative gradients} and hence the learning rate of each task for each parameter in adaptive learning rate approaches (e.g., AdaGrad, RMSProp, and Adam). Comprehensive experiments on computer vision and recommender system MTL datasets demonstrate that AdaTask significantly improves the performance of dominated tasks, resulting SOTA average task-wise performance. Analysis on both synthetic and real-world datasets shows AdaTask balance parameters in every shared layer well.
翻译:多任务学习(MTL)模型在计算机视觉、自然语言处理和推荐系统中展现了显著成果。尽管已有多种方法被提出,但这些方法如何在每个参数上平衡不同任务仍不明确。本文提出通过各任务对参数的更新总量来衡量参数的任务主导程度。具体而言,我们通过任务对应的参数平方更新指数衰减平均值(AU)计算总更新量。基于这一新型度量,我们发现现有MTL方法中的许多参数(尤其是高层共享层中的参数)仍然被一个或多个任务主导。AU的主导性主要源于一个或多个任务累积梯度的主导性。受此启发,我们提出一种任务感知的自适应学习率方法(简称AdaTask),通过分离自适应学习率方法(如AdaGrad、RMSProp和Adam)中每个任务对每个参数的累积梯度及其学习率。在计算机视觉和推荐系统MTL数据集上的综合实验表明,AdaTask显著提升了主导任务的表现,实现了最先进的任务平均性能。对合成数据集和真实数据集的进一步分析证明,AdaTask能够有效平衡每个共享层中的参数分布。