Syntactic Complexity Identification, Measurement, and Reduction Through Controlled Syntactic Simplification

Text simplification is one of the domains in Natural Language Processing (NLP) that offers an opportunity to understand the text in a simplified manner for exploration. However, it is always hard to understand and retrieve knowledge from unstructured text, which is usually in the form of compound and complex sentences. There are state-of-the-art neural network-based methods to simplify the sentences for improved readability while replacing words with plain English substitutes and summarising the sentences and paragraphs. In the Knowledge Graph (KG) creation process from unstructured text, summarising long sentences and substituting words is undesirable since this may lead to information loss. However, KG creation from text requires the extraction of all possible facts (triples) with the same mentions as in the text. In this work, we propose a controlled simplification based on the factual information in a sentence, i.e., triple. We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences. This simplification process will retain the original wording with a simple structure of possible domain facts in each sentence, i.e., triples. The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity (SC), followed by reduction through a controlled syntactic simplification process. Last, an experiment for a dataset re-annotation is also conducted through GPT3; we aim to publish this refined corpus as a resource. This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference. The code and data is available at www.github.com/sallmanm/SynSim.

翻译：文本简化是自然语言处理领域中的一个方向，旨在通过简化方式帮助理解文本。然而，从非结构化文本（通常以复合句和复杂句形式呈现）中理解和获取知识始终具有挑战性。现有基于神经网络的先进方法通过使用简明英语替代词汇并总结句子与段落来提升可读性。但在从非结构化文本构建知识图谱的过程中，对长句进行总结或替换词汇可能导致信息丢失，因此并不可取——知识图谱构建需要提取文本中与原始表述完全一致的所有事实三元组。本文提出一种基于句子事实信息（即三元组）的受控简化方法。我们采用经典的基于句法依存的方法，将复合句和复杂句拆解并改写为一组简化句子。该简化过程保留原有用词，同时使每个句子呈现简洁的领域事实结构（即三元组）。本文还提出一种识别与度量句子句法复杂度的算法，并通过受控句法简化过程实现复杂度降低。最后，利用GPT3进行数据集重标注实验，该精炼语料库将作为资源公开发布。本工作已被WSDM-2023会议的"知识图谱学习国际研讨会"接收并展示。代码与数据详见：www.github.com/sallmanm/SynSim。