Stringalign: Moving beyond summary statistics with a transparent Unicode-aware tool for evaluating automatic transcription models

Comparing text strings is crucial when evaluating and understanding the performance of various text processing tasks such as document recognition and audio transcription. With an increasingly complex landscape of AI-based handwritten text recognition (HTR), optical character recognition (OCR) and automatic speech recognition (ASR) models, there is a need for tools that facilitate evaluation in a flexible and reproducible way. This paper presents Stringalign, a Python library designed to simplify the evaluation process for automatic transcription projects and facilitate transparent evaluation. Stringalign's tools to examine and visualise both the rate of errors and the types of errors a model makes, give insights into possible improvements and help inform model selection for a particular task. Widely used string comparison metrics, such as the character and word error rates (CER and WER), although useful, can be ambiguous due to varying definitions of what constitutes a character and a word. Stringalign addresses this challenge by ensuring all preprocessing (i.e. normalisation and tokenisation) is transparent and easily replicable, and by providing tools to move beyond summary statistics and analyse common model errors. Moreover, Stringalign adheres to FAIR (Findable, Accessible, Interoperable, and Reusable) principles for research software while staying lightweight and easy to adapt into researchers existing workflows. In this paper, we discuss challenges with character and word level string comparisons and show through examples that where existing tools can yield opaque and sometimes confusing results, Stringalign provides an easy-to-use and unambiguous alternative.

翻译：摘要：在评估和理解文档识别、音频转录等文本处理任务的性能时，比较文本字符串至关重要。随着基于AI的手写文本识别（HTR）、光学字符识别（OCR）和自动语音识别（ASR）模型日益复杂，需要能够以灵活且可重复的方式促进评估的工具。本文介绍了Stringalign，这是一个旨在简化自动转录项目评估过程并促进透明评估的Python库。Stringalign的工具能够检查并可视化模型产生错误的频率和类型，从而洞察可能的改进方向，并为特定任务中模型的选择提供信息。广泛使用的字符串比较指标，如字符错误率（CER）和词错误率（WER），尽管有用，但由于对字符和词的定义存在差异，可能会产生歧义。Stringalign通过确保所有预处理（即归一化和分词）透明且易于复现，并提供工具以超越摘要统计并分析常见模型错误，从而应对了这一挑战。此外，Stringalign遵守研究软件的FAIR（可查找、可访问、可互操作、可复用）原则，同时保持轻量化并易于融入研究人员现有工作流程。在本文中，我们讨论了字符级和词级字符串比较的挑战，并通过示例表明，当现有工具可能产生不透明甚至令人困惑的结果时，Stringalign提供了一种易用且无歧义的替代方案。