Temporal expression identification is crucial for understanding texts written in natural language. Although highly effective systems such as HeidelTime exist, their limited runtime performance hampers adoption in large-scale applications and production environments. In this paper, we introduce the TEI2GO models, matching HeidelTime's effectiveness but with significantly improved runtime, supporting six languages, and achieving state-of-the-art results in four of them. To train the TEI2GO models, we used a combination of manually annotated reference corpus and developed ``Professor HeidelTime'', a comprehensive weakly labeled corpus of news texts annotated with HeidelTime. This corpus comprises a total of $138,069$ documents (over six languages) with $1,050,921$ temporal expressions, the largest open-source annotated dataset for temporal expression identification to date. By describing how the models were produced, we aim to encourage the research community to further explore, refine, and extend the set of models to additional languages and domains. Code, annotations, and models are openly available for community exploration and use. The models are conveniently on HuggingFace for seamless integration and application.
翻译:时间表达式识别对于理解自然语言文本至关重要。尽管存在诸如HeidelTime等高效系统,但其有限的运行时性能阻碍了在大规模应用及生产环境中的推广。本文介绍了TEI2GO模型,该模型在保持与HeidelTime同等有效性的同时显著提升了运行时性能,支持六种语言,并在其中四种语言上取得了最先进的结果。为训练TEI2GO模型,我们结合了人工标注参考语料库与所开发的"Professor HeidelTime"——一个经HeidelTime标注的新闻文本综合弱标注语料库。该语料库共包含$138,069$篇文档(涵盖六种语言),包含$1,050,921$个时间表达式,是迄今为止最大的开源时间表达式识别标注数据集。通过阐述模型构建过程,我们期望激励研究社区进一步探索、优化并将模型集扩展到更多语言和领域。代码、标注数据及模型均已开源以供社区探索和使用。模型已便捷部署至HuggingFace平台,可实现无缝集成与调用。