Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of highly unstructured real-world documents, limiting the effectiveness of current open-source models. This paper presents Typhoon OCR, an open VLM for document extraction tailored for Thai and English. The model is fine-tuned from vision-language backbones using a Thai-focused training dataset. The dataset is developed using a multi-stage data construction pipeline that combines traditional OCR, VLM-based restructuring, and curated synthetic data. Typhoon OCR is a unified framework capable of text transcription, layout reconstruction, and document-level structural consistency. The latest iteration of our model, Typhoon OCR V1.5, is a compact and inference-efficient model designed to reduce reliance on metadata and simplify deployment. Comprehensive evaluations across diverse Thai document categories, including financial reports, government forms, books, infographics, and handwritten documents, show that Typhoon OCR achieves performance comparable to or exceeding larger frontier proprietary models, despite substantially lower computational cost. The results demonstrate that open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching performance comparable to proprietary systems while remaining lightweight and deployable.

翻译：文档提取是数字化工作流程的核心组成部分，然而现有的视觉语言模型（VLM）主要偏向高资源语言。泰语由于非拉丁字母的文字复杂性、缺乏显式词边界以及现实中高度非结构化文档的普遍存在，带来了额外的挑战，限制了当前开源模型的有效性。本文提出了台风OCR，一个专为泰语和英语设计的开放视觉语言模型，用于文档提取。该模型基于视觉语言主干网络，使用一个以泰语为重点的训练数据集进行微调。该数据集通过一个多阶段数据构建流程开发而成，结合了传统OCR、基于VLM的重构以及精心策划的合成数据。台风OCR是一个统一的框架，能够进行文本转录、版面重建和文档级结构一致性维护。我们模型的最新版本台风OCR V1.5，是一个紧凑且推理高效的模型，旨在减少对元数据的依赖并简化部署。通过对包括财务报告、政府表格、书籍、信息图以及手写文档在内的多种泰语文档类别进行全面评估，结果表明，尽管计算成本显著降低，台风OCR实现了与或超越大型前沿专有模型相当的性能。这些结果证明，开放的视觉语言OCR模型能够为泰语文档实现准确的文本提取和版面重建，在保持轻量化和可部署性的同时，达到与专有系统相当的性能水平。