tieval: An Evaluation Framework for Temporal Information Extraction Systems

Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.

翻译：时间信息抽取（TIE）在过去二十年中引起了广泛关注，催生了大量数据集。尽管数据集的丰富性带来了诸多益处，但面对众多语料库时，对TIE系统进行基准测试却变得困难重重。一方面，不同数据集采用不同的标注方案，阻碍了跨语料库的系统性能比较。另一方面，每个语料库通常以不同格式发布，研究人员或从业者需投入大量工程精力才能为所有数据集开发解析器。这一限制迫使研究者仅选择有限的数据集评估系统，进而限制了系统间的可比性。此外，TIE系统可比性面临的另一障碍是评估指标的选择。大多数研究采用精确率、召回率和$F_1$等传统指标，而少数研究偏好时间感知度——一种专为更全面评估时间系统而设计的指标。尽管多数系统未采用时间感知度的原因尚不明确，但影响这一决策的关键因素之一是需要实现时间闭合算法以计算时间感知度——该算法实现复杂且目前缺乏便捷工具。总之，这些问题制约了不同方法间的公平比较，进而阻碍了时间抽取系统的发展。为解决这些问题，我们开发了tieval——一个提供统一接口用于导入不同语料库并简化系统评估的Python库。本文介绍了tieval的首个公开版本，并重点阐述了其核心功能。