Unsupervised learning objectives like language modeling and de-noising constitute a significant part in producing pre-trained models that perform various downstream applications from natural language understanding to conversational tasks. However, despite impressive conversational capabilities of recent large language model, their abilities to capture syntactic or semantic structure within text lag behind. We hypothesize that the mismatch between linguistic performance and competence in machines is attributable to insufficient transfer of linguistic structure knowledge to computational systems with currently popular pre-training objectives. We show that punctuation restoration transfers to improvements in in- and out-of-distribution performance on structure-related tasks like named entity recognition, open information extraction, chunking, and part-of-speech tagging. Punctuation restoration is an effective learning objective that can improve structure understanding and yield a more robust structure-aware representations of natural language.
翻译:无监督学习目标(如语言建模与去噪)在预训练模型生产中占据重要地位,这些模型可执行从自然语言理解到对话任务的各类下游应用。然而,尽管近期大语言模型展现出惊人的对话能力,其在捕捉文本句法或语义结构方面的能力仍有不足。我们假设,机器语言表现与能力的错位源于当前主流预训练目标未能将语言结构知识充分迁移至计算系统。研究表明,标点恢复可迁移至命名实体识别、开放信息提取、组块分析与词性标注等结构相关任务,并提升其在分布内与分布外数据的性能。标点恢复是一种有效学习目标,能够增强结构理解能力,并生成更鲁棒的结构感知自然语言表征。