Sentences that present a complex syntax act as a major stumbling block for downstream Natural Language Processing applications whose predictive quality deteriorates with sentence length and complexity. The task of Text Simplification (TS) may remedy this situation. It aims to modify sentences in order to make them easier to process, using a set of rewriting operations, such as reordering, deletion, or splitting. State-of-the-art syntactic TS approaches suffer from two major drawbacks: first, they follow a very conservative approach in that they tend to retain the input rather than transforming it, and second, they ignore the cohesive nature of texts, where context spread across clauses or sentences is needed to infer the true meaning of a statement. To address these problems, we present a discourse-aware TS approach that splits and rephrases complex English sentences within the semantic context in which they occur. Based on a linguistically grounded transformation stage that uses clausal and phrasal disembedding mechanisms, complex sentences are transformed into shorter utterances with a simple canonical structure that can be easily analyzed by downstream applications. With sentence splitting, we thus address a TS task that has hardly been explored so far. Moreover, we introduce the notion of minimality in this context, as we aim to decompose source sentences into a set of self-contained minimal semantic units. To avoid breaking down the input into a disjointed sequence of statements that is difficult to interpret because important contextual information is missing, we incorporate the semantic context between the split propositions in the form of hierarchical structures and semantic relationships. In that way, we generate a semantic hierarchy of minimal propositions that leads to a novel representation of complex assertions that puts a semantic layer on top of the simplified sentences.
翻译:呈现复杂句法的句子是下游自然语言处理应用的主要障碍,其预测质量会随句子长度和复杂性而下降。文本简化任务可改善这一状况,旨在通过重排序、删除或拆分等改写操作,使句子更易于处理。当前最先进的句法文本简化方法存在两个主要缺陷:首先,它们采取极其保守的策略,倾向于保留输入内容而非进行转换;其次,它们忽略了文本的衔接性——需要跨从句或句子的上下文语境才能推断语句的真实含义。为解决这些问题,我们提出一种语篇感知的文本简化方法,能够在语义语境中拆分并改写复杂英语句子。该方法基于语言学的转换阶段,通过从句和短语的脱嵌机制,将复杂句子转换为具有简单规范结构的更短语句,便于下游应用分析。通过句子拆分,我们解决了一项迄今鲜少被探索的文本简化任务。此外,我们在此引入最小性概念,旨在将源句子分解为一组自包含的最小语义单元。为避免因缺失重要上下文信息而将输入拆解为难以理解的零散语句序列,我们以层级结构和语义关系的形式嵌入拆分命题间的语义语境。通过这种方式,我们生成最小命题的语义层级结构,从而为复杂断言建立新颖的表示形式,在简化句子上叠加语义层。