Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.
翻译:信息图被广泛用于结合文本、图标和数据可视化来传达信息,但一旦导出为图像,其内容便被锁定在像素中,使得更新、本地化和重用成本高昂。我们描述了\textsc{Images2Slides},一个基于API的流程,通过使用视觉语言模型提取区域级规范,将像素几何映射到幻灯片坐标,并利用Google Slides批量更新API重新创建元素,将静态信息图(PNG/JPG)转换为原生、可编辑的Google Slides幻灯片。该系统是模型无关的,通过通用的JSON区域模式和确定性后处理支持多种VLM后端。在一个包含29个程序生成、已知真实区域的信息图幻灯片的受控基准测试中,\textsc{Images2Slides}实现了$0.989\pm0.057$的整体元素恢复率(文本:$0.985\pm0.083$,图像:$1.000\pm0.000$),文本区域的平均转录错误率$\mathrm{CER}=0.033\pm0.149$,平均布局保真度$\mathrm{IoU}=0.364\pm0.161$(文本区域)和$0.644\pm0.131$(图像区域)。我们还重点讨论了重建中的实际工程挑战,包括文本大小校准和非均匀背景,并描述了指导未来工作的失败模式。