Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, and MQM, and is extensible to support new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and dynamic assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
翻译:人类评估是多语言自然语言处理的黄金标准,但在实践中常因现有工具设置复杂、工程与运维开销巨大而被跳过,转而使用自动评估指标。我们推出Pearmut这一轻量级但功能丰富的平台,使端到端的人类评估如同运行自动评估一样简便。Pearmut消除了常见入门障碍,支持多语言任务评估,尤其聚焦于机器翻译。该平台实现了DA、ESA和MQM等标准评估协议,并可扩展以支持新协议。其功能包括文档级上下文、绝对与对比评估、注意力检测、ESAAI预标注以及静态与动态分配策略。Pearmut使可靠的人类评估成为模型开发与诊断中常规化的实用环节,而非仅偶发性的工作。