Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.
翻译:自然语言常将多个概念融合于复杂句子之中。原子句抽取任务旨在将复杂句子分解为每个仅表达单一概念的简单句,从而提升信息检索、问答系统及自动推理系统的性能。先前研究已形式化"拆分与重述"任务并建立评估指标,基于大语言模型的机器学习方法亦提升了抽取准确率。然而这些方法缺乏可解释性,且难以揭示何种语言结构会导致抽取失败。尽管已有研究探索基于依存关系的"主谓宾"三元组及从句抽取,但尚未有系统性分析考察具体从句结构与依存关系如何引发抽取困难。本研究通过分析复杂句结构(包括关系从句、状语从句、并列结构及被动语态)对基于规则的原子句抽取性能的影响,以填补此研究空白。基于WikiSplit数据集,我们使用spaCy实现了基于依存关系的抽取规则,生成100组黄金标准原子句集,并采用ROUGE与BERTScore进行评估。系统获得ROUGE-1 F1=0.6714、ROUGE-2 F1=0.478、ROUGE-L F1=0.650及BERTScore F1=0.5898,表明其具有中高程度的词汇、结构与语义对齐能力。具有挑战性的结构包括关系从句、同位语、并列谓语、状语从句及被动语态。总体而言,基于规则的抽取方法具有合理准确性,但对句法复杂性较为敏感。