We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model's 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.
翻译:我们提出了一种字体分类系统,能够从渲染的文本图像中识别394种字体系列。该方法采用低秩自适应(LoRA)对DINOv2视觉Transformer进行微调,在仅训练模型8720万参数中不足1%的情况下,实现了约86%的top-1准确率。我们引入了一种合成数据集生成流程,该流程大规模渲染Google Fonts并应用包括随机颜色、对齐方式、换行处理和高斯噪声在内的多样化数据增强,生成的训练图像能够泛化至真实世界的排版样本。该模型内置预处理模块以确保训练与推理阶段的一致性,并已部署为HuggingFace推理端点。我们将模型、数据集及完整训练流程作为开源资源公开发布。