This paper presents Callico, a web-based open source platform designed to simplify the annotation process in document recognition projects. The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data. For document image annotation, Callico offers dual-display annotation for digitised documents, enabling simultaneous visualisation and annotation of scanned images and text. This capability is critical for OCR and HTR model training, document layout analysis, named entity recognition, form-based key value annotation or hierarchical structure annotation with element grouping. The platform supports collaborative annotation with versatile features backed by a commitment to open source development, high-quality code standards and easy deployment via Docker. Illustrative use cases - including the transcription of the Belfort municipal registers, the indexing of French World War II prisoners for the ICRC, and the extraction of personal information from the Socface project's census lists - demonstrate Callico's applicability and utility.
翻译:本文介绍了Callico——一个基于Web的开源平台,旨在简化文档识别项目中的标注流程。机器学习和深度学习领域向以数据为中心的AI的转变,凸显了高质量数据的重要性,以及开发专业工具以提高此类数据生成效率与有效性的迫切需求。针对文档图像标注,Callico为数字化文档提供双屏标注功能,实现扫描图像与文本的同步可视化和标注。这一能力对OCR和HTR模型训练、文档布局分析、命名实体识别、基于表单的键值标注或基于元素分组的分层结构标注至关重要。该平台支持协作式标注,具备多功能特性,并遵循开源开发理念、高质量代码标准以及通过Docker轻松部署的原则。通过阐述贝尔福市政登记册转录、国际红十字会法国二战战俘索引编制以及Socface项目人口普查名单个人信息提取等典型用例,展示了Callico的适用性与实用性。