Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.
翻译:开放词汇场景草图语义分割旨在基于推理时指定的灵活类别词汇,为稀疏线条图赋予密集语义标签,且训练过程无需依赖像素级标注。与自然图像不同,草图缺乏纹理和颜色线索,其语义理解高度依赖笔画布局与空间配置——这一挑战导致单层视觉-语言特征本质上不稳定。我们的关键观察是:不同Vision Transformer层生成的注意力图编码了互补的空间线索——浅层捕获全局结构布局,深层聚焦局部笔画交点和物体部件。这表明跨层聚合比单层能提供更稳健的结构先验。基于此洞察,我们提出了一个结构感知框架,该框架建立在**逐层累积结构注意力(LASA)**之上,通过聚合多层注意力在弱监督下指导层次化语义对齐,并在推理阶段优化预测。在FS-COCO、SFSD和FrISS上的实验表明,LASA在mIoU上相较于先前的弱监督基线分别提升了$+3.43$、$+8.01$和$+15.74$,展示了在分割精度与空间一致性上的持续增益。我们的源代码将公开发布。