tieval: An Evaluation Framework for Temporal Information Extraction Systems

Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.

翻译：时序信息抽取在过去二十年吸引了大量关注，催生了众多数据集。尽管数据集数量丰富具有优势，但在对时序信息抽取系统进行基准测试时却带来了困难。一方面，不同数据集采用不同标注方案，阻碍了跨语料库的系统性能比较。另一方面，每个语料库通常以不同格式发布，研究人员或从业者需要投入大量工程精力为所有语料库开发解析器。这种限制迫使研究者仅选择有限数量的数据集评估其系统，进而限制了系统的可比性。此外，评估指标的选择是影响时序信息抽取系统可比性的另一障碍。尽管多数研究采用精确率、召回率和F1值等传统指标，但少数研究倾向于使用时序感知度——这种指标专为更全面评估时序系统而设计。虽然多数系统未采用时序感知度进行评估的具体原因尚不明确，但实现时序闭包算法所需的计算代价无疑是影响因素之一——该算法实现复杂且目前缺乏现成解决方案。总而言之，这些问题制约了不同方法间的公平比较，进而阻碍了时序抽取系统的发展。为缓解上述问题，我们开发了tieval——一个提供统一接口以导入不同语料库并简化系统评估的Python库。本文首次公开发布tieval，并重点介绍其最具代表性的功能特性。