The Segment Anything Model (SAM), a profound vision foundation model pre-trained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in text segmentation across four hierarchies, including stroke, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality text stroke segmentation (TSS) model through a parameter-efficient fine-tuning approach. We use this TSS model to iteratively generate the text stroke labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TSS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation mode. In terms of the AMG mode, Hi-SAM segments text stroke foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the promptable mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TSS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for text stroke segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring 20x fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.
翻译:分段任意模型(SAM)作为一个在大规模数据集上预训练的视觉基础模型,突破了一般分割的边界,并激发了多种下游应用。本文介绍Hi-SAM,一种利用SAM进行层次化文本分割的统一模型。Hi-SAM在包括笔画、单词、文本行和段落的四个文本层级上表现出色,同时也能实现布局分析。具体来说,我们首先通过参数高效的微调方法将SAM转化为高质量文本笔画分割(TSS)模型。我们使用该TSS模型以半自动方式迭代生成文本笔画标签,统一了HierText数据集中四个文本层级的标签。随后,基于这些完整标签,我们基于TSS架构并配备定制的层次化掩码解码器,启动了端到端可训练的Hi-SAM。在推理阶段,Hi-SAM提供自动掩码生成(AMG)模式和提示分割模式。在AMG模式下,Hi-SAM首先分割文本笔画前景掩码,然后采样前景点以生成层次化文本掩码,并顺带实现布局分析。在提示模式下,Hi-SAM通过单个点点击即可提供单词、文本行和段落掩码。实验结果表明,我们的TSS模型达到了最优性能:在Total-Text上文本笔画分割的fgIOU为84.86%,在TextSeg上为88.96%。此外,与之前在HierText上进行联合层次检测与布局分析的专业模型相比,Hi-SAM实现了显著提升:在文本行层级上,PQ提升4.73%,F1提升5.39%;在段落层级布局分析上,PQ提升5.49%,F1提升7.39%,且训练轮次减少20倍。代码已开源:https://github.com/ymy-k/Hi-SAM。