Linear Text Segmentation is the task of automatically tagging text documents with topic shifts, i.e. the places in the text where the topics change. A well-established area of research in Natural Language Processing, drawing from well-understood concepts in linguistic and computational linguistic research, the field has recently seen a lot of interest as a result of the surge of text, video, and audio available on the web, which in turn require ways of summarising and categorizing the mole of content for which linear text segmentation is a fundamental step. In this survey, we provide an extensive overview of current advances in linear text segmentation, describing the state of the art in terms of resources and approaches for the task. Finally, we highlight the limitations of available resources and of the task itself, while indicating ways forward based on the most recent literature and under-explored research directions.
翻译:线性文本分割的任务是自动标注文本中主题转换的位置,即文本中主题发生变化的段落。作为自然语言处理领域中一个历史悠久的研究方向,它植根于语言学和计算语言学研究中的成熟概念。近年来,随着网络文本、视频和音频内容的激增,对海量内容进行摘要与分类的需求日益增长,而线性文本分割正是实现这一目标的基础步骤,因此该领域重新引起了广泛关注。本综述全面梳理了线性文本分割的最新进展,从任务资源与方法论的角度阐述了当前的技术水平。最后,我们指出了现有资源的局限性及任务本身存在的挑战,并基于最新文献和尚未充分探索的研究方向,提出了未来可能的发展路径。