Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks. We find that seq2seq models trained with early-stopping suffer from issues at the token level. In particular, while some tokens in the vocabulary demonstrate overfitting, others underfit when training is stopped. Experiments show that the phenomena are pervasive in different models, even in fine-tuned large pretrained-models. We identify three major factors that influence token-level fitting, which include token frequency, parts-of-speech, and prediction discrepancy. Further, we find that external factors such as language, model size, domain, data scale, and pretraining can also influence the fitting of tokens.
翻译:序列到序列(seq2seq)模型已广泛应用于自然语言处理、计算机视觉及其他深度学习任务。我们发现,采用早停法训练的seq2seq模型在令牌层面存在拟合问题。具体而言,当训练停止时,词表中的部分令牌出现过拟合,而其他令牌则存在欠拟合现象。实验表明,该现象普遍存在于不同模型中,甚至包括微调后的预训练大模型。我们确定了影响令牌级拟合的三个主要因素:令牌频率、词性标注及预测差异。此外,我们还发现语言类型、模型规模、领域、数据规模和预训练等外部因素同样会影响令牌的拟合情况。