Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.
翻译:文本分割在自然语言处理(NLP)领域具有至关重要的意义,对信息检索与文档摘要等多项下游任务起着关键作用。本研究提出一种基于双向Transformer的文本分割新方法TocBERT。该方法通过监督学习从语义表示中识别标题与子标题,并将该任务构建为命名实体识别(NER)问题。本方案应用于医疗文本分割场景,通过微调Bio-ClinicalBERT模型对MIMIC-III数据集出院小结进行分割。基于250份人工标注的真实文本语料评估显示,TocBERT在线性文本分割任务中取得84.6%的F1分数,在层次化文本分割任务中取得72.8%的F1分数。相较于精心设计的基于规则的解决方案,该方法在区分主标题与子标题方面表现尤为突出。