Transformer-based models, even though achieving super-human performance on several downstream tasks, are often regarded as a black box and used as a whole. It is still unclear what mechanisms they have learned, especially their core module: multi-head attention. Inspired by functional specialization in the human brain, which helps to efficiently handle multiple tasks, this work attempts to figure out whether the multi-head attention module will evolve similar function separation under multi-tasking training. If it is, can this mechanism further improve the model performance? To investigate these questions, we introduce an interpreting method to quantify the degree of functional specialization in multi-head attention. We further propose a simple multi-task training method to increase functional specialization and mitigate negative information transfer in multi-task learning. Experimental results on seven pre-trained transformer models have demonstrated that multi-head attention does evolve functional specialization phenomenon after multi-task training which is affected by the similarity of tasks. Moreover, the multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.
翻译:基于Transformer的模型虽然在多个下游任务上取得了超越人类的表现,但常被视为黑箱整体使用。其核心模块——多头注意力究竟学到了何种机制尚不明确。受人类大脑中有助于高效处理多任务的功能特异性启发,本文尝试探究多头注意力模块在多任务训练下是否会演化出类似的功能分离现象。若存在这种机制,是否能进一步提升模型性能?为解答这些问题,我们引入了一种量化多头注意力功能特异性程度的解释方法,并进一步提出了一种简单的多任务训练方法,用于增强功能特异性并缓解多任务学习中的负向信息迁移。在七个预训练Transformer模型上的实验表明,多头注意力在多任务训练后确实会演化出受任务相似性影响的功能特异性现象。此外,基于功能特异性的多任务训练策略在不增加任何参数的情况下,同时提升了多任务学习和迁移学习的性能。