Sequence-to-sequence (seq2seq) models have been widely used for natural language processing, computer vision, and other deep learning tasks. We find that seq2seq models trained with early-stopping suffer from issues at the token level. In particular, while some tokens in the vocabulary demonstrate overfitting, others underfit when training is stopped. Experiments show that the phenomena are pervasive in different models, even in fine-tuned large pretrained-models. We identify three major factors that influence token-level fitting, which include token frequency, parts-of-speech, and prediction discrepancy. Further, we find that external factors such as language, model size, domain, data scale, and pretraining can also influence the fitting of tokens.
翻译:序列到序列模型已广泛应用于自然语言处理、计算机视觉及其他深度学习任务。我们发现,采用早停法训练的序列到序列模型在分词级别存在拟合问题。具体而言,当训练停止时,词表中部分分词出现过拟合,而其他分词则存在欠拟合现象。实验表明,这种现象在不同模型中普遍存在,甚至出现在微调后的大型预训练模型中。我们识别出影响分词级别拟合的三个主要因素,包括分词频率、词性和预测差异度。此外,研究还发现语言类型、模型规模、领域、数据规模及预训练等外部因素同样会影响分词的拟合效果。