There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak'wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents -- a task that is often undertaken by endangered language community members and researchers -- by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
翻译:近年来,人们对改进濒危语言的光学字符识别(OCR)产生了兴趣,尤其是因为这些语言的大量文档和书籍未采用机器可读格式。OCR 系统的性能通常使用字符错误率和单词错误率等自动指标进行评估。尽管错误率有助于比较不同模型和系统,但它们无法衡量 OCR 工具生成的转录对下游用户是否有效以及如何有效。本文以 Kwak'wala 语言为案例,提出了一种以人为中心的 OCR 系统评估方法。通过用户研究,我们表明使用 OCR 可将文化价值文档的手动转录时间减少 50% 以上——这项任务通常由濒危语言社区成员和研究人员承担。我们的研究结果证明了 OCR 工具对下游语言记录和振兴工作的潜在益处。