Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other tasks. DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references. With this approach, evaluators can achieve a better understanding of the performance of each model. Moreover, the software uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics. As a result, evaluators are able to identify clusters of performance among the swarm of proposals and have a better understanding of the significance of their differences. Additionally, we offer a visualization web-app to ensure that the results can be adequately understood and interpreted. Finally, we present an exemplary case of use of DEEP.
翻译:在科学研究中,对多个系统进行比较评估是一项常见任务。无论是为研究工作选择适用系统,还是在完成研究后展示所提出模型的潜力,这都是关键步骤。此外,这也是公开竞赛评估的主要任务。我们提出的软件(DEEP)实现了机器翻译与光学字符识别模型的自动化执行与评分,并可轻松扩展至其他任务。DEEP支持接收容器化系统、运行系统(同时提取运行信息)并根据参考标准对输出假设进行评估。通过该方法,评估者能更深入理解各模型的性能表现。该软件还采用基于统计显著性分析的聚类算法,依据评估指标对各模型输出结果进行聚类分析。这使得评估者能在众多候选方案中识别出性能簇群,并更准确地理解性能差异的统计意义。此外,我们提供了可视化网络应用,确保结果能够被充分理解与解读。最后,我们展示了DEEP的典型应用案例。