Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.
翻译:近期基于视觉语言模型的方法在图像矢量化任务上取得了显著成果。然而,这些方法通常仅在高分辨率光栅化的合成基准测试上进行评估,即先对清晰的SVG文件进行高分辨率光栅化处理,再重新进行矢量化。这种评估方式导致其难以泛化至真实场景,例如未知光栅化方式的图像或由文本生成图像模型生成的图像。我们提出VectorArk——一种面向鲁棒且实用图像矢量化任务的新型视觉语言模型。该模型采用创新的圆角多边形表示方法,不仅能简化学习过程,还能自然地生成平滑美观的图元。同时,我们设计了退化模型来增强对多样化和不完美输入的鲁棒性。实验表明,与现有方法相比,VectorArk在多个数据集上展现出更优的几何完整性和伪影抑制能力,全面的消融实验验证了各组件的有效性。