Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
翻译:人工评估是多语言自然语言处理领域的黄金标准,但在实践中常因现有工具设置复杂、耗时且工程与运营开销巨大而被跳过,转而采用自动指标替代。我们推出Pearmut,一个轻量级但功能丰富的平台,使端到端人工评估变得如同运行自动评估一样简便。Pearmut消除了常见的入门障碍,并为多语言任务评估提供支持,尤其专注于机器翻译。该平台实现了标准评估协议,包括DA、ESA和MQM,同时具备可扩展性以支持新协议的原型设计。其特色功能包括文档级上下文、绝对与对比评估、注意力检查、ESAAI预标注,以及基于静态和主动学习的分配策略。Pearmut使得可靠的人工评估能够成为模型开发与诊断中实用、常规的组成部分,而非偶尔为之的工作。