Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

The Segment Anything Model (SAM), a profound vision foundation model pre-trained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in text segmentation across four hierarchies, including stroke, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality text stroke segmentation (TSS) model through a parameter-efficient fine-tuning approach. We use this TSS model to iteratively generate the text stroke labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TSS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation mode. In terms of the AMG mode, Hi-SAM segments text stroke foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the promptable mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TSS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for text stroke segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring 20x fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

翻译：分段任意模型（SAM）作为一个在大规模数据集上预训练的视觉基础模型，突破了一般分割的边界，并激发了多种下游应用。本文介绍Hi-SAM，一种利用SAM进行层次化文本分割的统一模型。Hi-SAM在包括笔画、单词、文本行和段落的四个文本层级上表现出色，同时也能实现布局分析。具体来说，我们首先通过参数高效的微调方法将SAM转化为高质量文本笔画分割（TSS）模型。我们使用该TSS模型以半自动方式迭代生成文本笔画标签，统一了HierText数据集中四个文本层级的标签。随后，基于这些完整标签，我们基于TSS架构并配备定制的层次化掩码解码器，启动了端到端可训练的Hi-SAM。在推理阶段，Hi-SAM提供自动掩码生成（AMG）模式和提示分割模式。在AMG模式下，Hi-SAM首先分割文本笔画前景掩码，然后采样前景点以生成层次化文本掩码，并顺带实现布局分析。在提示模式下，Hi-SAM通过单个点点击即可提供单词、文本行和段落掩码。实验结果表明，我们的TSS模型达到了最优性能：在Total-Text上文本笔画分割的fgIOU为84.86%，在TextSeg上为88.96%。此外，与之前在HierText上进行联合层次检测与布局分析的专业模型相比，Hi-SAM实现了显著提升：在文本行层级上，PQ提升4.73%，F1提升5.39%；在段落层级布局分析上，PQ提升5.49%，F1提升7.39%，且训练轮次减少20倍。代码已开源：https://github.com/ymy-k/Hi-SAM。