Unsupervised learning objectives like language modeling and de-noising constitute a significant part in producing pre-trained models that perform various downstream applications from natural language understanding to conversational tasks. However, despite impressive generative capabilities of recent large language models, their abilities to capture syntactic or semantic structure within text lag behind. We hypothesize that the mismatch between linguistic performance and competence in machines is attributable to insufficient transfer of linguistic structure knowledge to computational systems with currently popular pre-training objectives. We show that punctuation restoration as a learning objective improves in- and out-of-distribution performance on structure-related tasks like named entity recognition, open information extraction, chunking, and part-of-speech tagging. Punctuation restoration is an effective learning objective that can improve structure understanding and yield a more robust structure-aware representations of natural language.
翻译:无监督学习目标如语言建模和去噪在生成预训练模型中占据重要地位,这些模型可执行从自然语言理解到对话任务的多种下游应用。然而,尽管近期大型语言模型具备令人瞩目的生成能力,其在文本中捕捉句法或语义结构的能力仍显不足。我们假设机器语言性能与能力之间的不匹配,归因于当前流行的预训练目标未能将语言结构知识充分迁移至计算系统。研究表明,将标点恢复作为学习目标能提升命名实体识别、开放信息抽取、组块分析和词性标注等结构相关任务的分布内及分布外性能。标点恢复是一种有效的学习目标,可提升结构理解能力并生成更鲁棒的、具有结构感知能力的自然语言表征。