SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement

We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.

翻译：本文提出SPRITETOMESH，这是一个将二维游戏精灵图像转换为适用于Spine2D等骨骼动画框架的三角形网格的全自动流程。传统上，创建可用于动画的网格是一个繁琐的手动过程，需要美术师沿视觉边界精心放置顶点，每个精灵通常耗时15-60分钟。我们的方法通过一种混合学习-算法方案解决此问题。基于从172款游戏中收集的超过10万个精灵-掩码对进行训练的语义分割网络（采用EfficientNet-B0编码器与U-Net解码器）实现了0.87的交并比，能够从任意输入图像生成精确的二值掩码。从这些掩码中，我们通过结合自适应弧段细分的Douglas-Peucker算法提取外部轮廓顶点，并基于经双边滤波的多通道Canny边缘检测所识别的视觉边界，采用轮廓跟踪放置策略生成内部顶点。通过基于掩码的质心筛选与Delaunay三角剖分得到最终网格。在对照实验中，我们证明基于神经网络热图回归的直接顶点位置预测方法本质上不适用于此任务：在相同训练条件下，分割解码器可正常收敛，而热图解碼器始终无法收敛（损失值停滞于0.061）。我们将此归因于顶点放置本质上具有艺术性特征——同一精灵可通过多种不同方式生成有效网格。这一否定性结果验证了我们的混合设计思路：在真值明确的环节采用学习式分割，在适合领域启发式规则的环节采用算法化放置。完整流程处理单个精灵耗时不足3秒，相比人工制作实现了300倍至1200倍的加速。我们已将训练模型开源给游戏开发社区。