The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
翻译:随着配备触摸屏和触控笔的平板设备日益普及,将手写内容转换为文本的核心功能变得尤为重要,这为搜索、索引和人工智能辅助提供了可能。与此同时,视觉语言模型(Vision-Language Models, VLMs)凭借其在多任务中的卓越性能,以及训练、微调和推理流程统一化的简洁性,已成为图像理解领域的主流解决方案。尽管VLMs在基于图像的任务中表现优异,但在直接采用朴素方法(即将手写内容渲染为图像并通过光学字符识别OCR处理)进行手写识别时,其性能显著不足。本文突破传统OCR局限,系统研究了基于VLMs的在线手写识别方法。我们提出了一种全新的数字墨水(在线手写)标记化表征方案,该方案同时包含笔划时序序列的文本表示与图像表示。实验证明,该表征方法在性能上可比肩甚至超越现有最先进的在线手写识别系统。通过在两个不同VLM家族及多个公开数据集上的验证,展现了该方法广泛的适用性。本方案可直接应用于现成的VLM架构,无需修改模型结构,同时支持全参数微调和参数高效微调。我们通过详细的消融实验,系统识别了所提表征方案中的关键要素。