Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.
翻译:草图语义分割是计算机视觉中一个被充分研究且关键的问题,涉及为单个笔画分配预定义的部分标签。本文提出ContextSeg——一种简单而高效的两阶段方法。在第一阶段,为了更好地编码笔画的形状和位置信息,我们提出在自编码器网络中预测额外的密集距离场,以强化结构信息学习。在第二阶段,我们将整个笔画视为单一实体,并使用带有默认注意力机制的自回归Transformer对同一语义部分内的笔画组进行标注。通过基于组的标注,我们的方法在对剩余笔画组做决策时能充分利用上下文信息。与两个代表性数据集上的最新方法相比,我们的方法实现了最佳分割精度,并通过广泛评估证明了其优越性能。此外,我们为解决训练数据中的部分不平衡问题以及跨类别训练的初步实验提供了见解,这可为该领域的未来研究提供启发。