Evaluation plays a critical role in deep learning as a fundamental block of any prediction-based system. However, the vast number of Natural Language Processing (NLP) tasks and the development of various metrics have led to challenges in evaluating different systems with different metrics. To address these challenges, we introduce jury, a toolkit that provides a unified evaluation framework with standardized structures for performing evaluation across different tasks and metrics. The objective of jury is to standardize and improve metric evaluation for all systems and aid the community in overcoming the challenges in evaluation. Since its open-source release, jury has reached a wide audience and is available at https://github.com/obss/jury.
翻译:评估作为任何基于预测的系统的基础模块,在深度学习中扮演着关键角色。然而,自然语言处理任务的庞大规模及各类评估指标的发展,导致不同系统在不同指标下的评估面临挑战。为解决这些问题,我们推出Jury——一个提供统一评估框架的工具包,通过标准化结构支持跨任务和跨指标的评估。Jury的目标是统一并改进所有系统的评估指标,帮助学界克服评估中的难题。自开源发布以来,Jury已获广泛关注,并可通过https://github.com/obss/jury访问。