Syntactic parsing is essential in natural-language processing, with constituent structure being one widely used description of syntax. Traditional views of constituency demand that constituents consist of adjacent words, but this poses challenges in analysing syntax with non-local dependencies, common in languages like German. Therefore, in a number of treebanks like NeGra and TIGER for German and DPTB for English, long-range dependencies are represented by crossing edges. Various grammar formalisms have been used to describe discontinuous trees - often with high time complexities for parsing. Transition-based parsing aims at reducing this factor by eliminating the need for an explicit grammar. Instead, neural networks are trained to produce trees given raw text input using supervised learning on large annotated corpora. An elegant proposal for a stack-free transition-based parser developed by Coavoux and Cohen (2019) successfully allows for the derivation of any discontinuous constituent tree over a sentence in worst-case quadratic time. The purpose of this work is to explore the introduction of supertag information into transition-based discontinuous constituent parsing. In lexicalised grammar formalisms like CCG (Steedman, 1989) informative categories are assigned to the words in a sentence and act as the building blocks for composing the sentence's syntax. These supertags indicate a word's structural role and syntactic relationship with surrounding items. The study examines incorporating supertag information by using a dedicated supertagger as additional input for a neural parser (pipeline) and by jointly training a neural model for both parsing and supertagging (multi-task). In addition to CCG, several other frameworks (LTAG-spinal, LCFRS) and sequence labelling tasks (chunking, dependency parsing) will be compared in terms of their suitability as auxiliary tasks for parsing.
翻译:句法分析在自然语言处理中至关重要,成分结构是广泛使用的句法描述形式之一。传统的成分观要求成分由相邻词语构成,但这在分析具有非局部依赖关系的句法时面临挑战,此类现象在德语等语言中十分常见。因此,在NeGra、TIGER(德语)和DPTB(英语)等众多树库中,长距离依赖关系通过交叉边来表示。已有多种语法形式体系被用于描述非连续树结构,但其句法分析过程往往具有较高的时间复杂度。基于转移的句法分析方法通过消除对显式语法的依赖来降低这一复杂度。取而代之的是,神经网络通过在大规模标注语料上进行监督学习,训练直接从原始文本输入生成句法树的能力。Coavoux和Cohen(2019)提出的无堆栈转移式句法分析器实现了优雅的解决方案,能够在最坏情况下的二次时间内推导出任意句子的非连续成分树。本研究旨在探索将超标注信息引入基于转移的非连续成分句法分析。在CCG(Steedman, 1989)等词汇化语法形式体系中,信息丰富的范畴类别被分配给句子中的词语,这些超标注作为构建句子句法结构的基本单元。超标注指示了词语的结构角色及其与周围成分的句法关系。本研究通过两种方式考察超标注信息的整合:使用专用超标注器作为神经句法分析器的附加输入(流水线模式),以及联合训练同时进行句法分析和超标注的神经模型(多任务模式)。除CCG外,本研究还将比较其他多种框架(LTAG-spinal、LCFRS)和序列标注任务(组块分析、依存句法分析)作为句法分析辅助任务的适用性。