Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

翻译：Segment Anything Model（SAM）是一种基于大规模数据集预训练的深度视觉基础模型，它突破了通用分割的边界并催生了多种下游应用。本文提出Hi-SAM，这是一种利用SAM实现分层文本分割的统一模型。Hi-SAM在像素级文本、单词、文本行和段落四个层级均能实现卓越的分割性能，同时完成版面分析任务。具体而言，我们首先通过参数高效的微调方法将SAM转化为高质量的像素级文本分割（TS）模型。利用该TS模型以半自动方式迭代生成像素级文本标注，从而统一HierText数据集中四个文本层级的标注标准。随后，基于完整的标注数据，我们在TS架构基础上引入定制化的分层掩码解码器，构建可端到端训练的Hi-SAM模型。在推理阶段，Hi-SAM提供自动掩码生成（AMG）模式和可提示分割（PS）模式。在AMG模式下，Hi-SAM首先生成像素级文本前景掩码，随后对前景点进行采样以生成分层文本掩码，并同步完成版面分析。在PS模式下，Hi-SAM仅需单点点击即可提供单词、文本行及段落级别的掩码。实验结果表明，我们的TS模型在像素级文本分割任务上达到最先进性能：在Total-Text数据集上获得84.86%的fgIOU，在TextSeg数据集上获得88.96%的fgIOU。此外，与先前在HierText数据集上进行联合分层检测与版面分析的专用模型相比，Hi-SAM在显著减少20倍训练轮数的前提下，取得显著提升：文本行级别版面分析指标提升4.73% PQ和5.39% F1，段落级别提升5.49% PQ和7.39% F1。代码已开源：https://github.com/ymy-k/Hi-SAM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日