Unsupervised learning objectives like autoregressive and masked language modeling constitute a significant part in producing pre-trained representations that perform various downstream applications from natural language understanding to conversational tasks. However, despite impressive generative capabilities of recent large language models, their abilities to capture syntactic or semantic structure within text lag behind. We hypothesize that the mismatch between linguistic performance and competence in machines is attributable to insufficient learning of linguistic structure knowledge via currently popular pre-training objectives. Working with English, we show that punctuation restoration as a learning objective improves performance on structure-related tasks like named entity recognition, open information extraction, chunking, and part-of-speech tagging. Punctuation restoration results in $\blacktriangle$$\geq2\%$p improvement in 16 out of 18 experiments, across 6 out of 7 tasks. Our results show that punctuation restoration is an effective learning objective that can improve structure understanding and yield a more robust structure-aware representations of natural language in base-sized models.
翻译:自回归与掩码语言建模等无监督学习目标在生成预训练表征方面占据重要地位,这些表征能够支撑从自然语言理解到对话任务的各种下游应用。然而,尽管近期大语言模型展现出令人印象深刻的生成能力,但其捕捉文本内部句法或语义结构的能力仍显不足。我们假设,机器在语言表现力与语言能力之间的失配可归因于当前流行的预训练目标未能充分学习语言结构知识。以英语为研究对象,我们证明将标点符号恢复作为学习目标能够提升结构相关任务(如命名实体识别、开放信息抽取、组块分析和词性标注)的性能。在18项实验中的16项(覆盖7个任务中的6个),标点符号恢复任务实现了$\blacktriangle$$\geq2\%$的性能提升。研究结果表明,标点符号恢复是一种有效的学习目标,能够增强基础规模模型对语言结构的理解,并生成更具鲁棒性的结构感知自然语言表征。