Long sentences have been a persistent issue in written communication for many years since they make it challenging for readers to grasp the main points or follow the initial intention of the writer. This survey, conducted using the PRISMA guidelines, systematically reviews two main strategies for addressing the issue of long sentences: a) sentence compression and b) sentence splitting. An increased trend of interest in this area has been observed since 2005, with significant growth after 2017. Current research is dominated by supervised approaches for both sentence compression and splitting. Yet, there is a considerable gap in weakly and self-supervised techniques, suggesting an opportunity for further research, especially in domains with limited data. In this survey, we categorize and group the most representative methods into a comprehensive taxonomy. We also conduct a comparative evaluation analysis of these methods on common sentence compression and splitting datasets. Finally, we discuss the challenges and limitations of current methods, providing valuable insights for future research directions. This survey is meant to serve as a comprehensive resource for addressing the complexities of long sentences. We aim to enable researchers to make further advancements in the field until long sentences are no longer a barrier to effective communication.
翻译:长句长期以来一直是书面交流中的难题,导致读者难以把握要点或理解作者原意。本综述遵循PRISMA指南,系统梳理了应对长句问题的两大主流策略:a) 句子压缩与b) 句子拆分。研究表明,自2005年以来该领域关注度持续攀升,2017年后呈现显著增长态势。当前研究以监督学习方法为主,覆盖句子压缩与拆分两大任务。然而,弱监督与自监督技术仍存在显著空白,这为数据稀缺领域提供了重要研究契机。本综述将最具代表性的方法归纳为系统化分类体系,并在常见句子压缩与拆分数据集上开展对比评估分析。最后,我们探讨了现有方法的挑战与局限,为未来研究方向提供了有价值的见解。本综述旨在成为攻克长句复杂性的综合资源,助力研究人员在该领域持续突破,直至长句不再成为有效沟通的障碍。