We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often "which of these two good translations is better?" rather than "is this translation acceptable?" This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 "LT" score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.
翻译:我们介绍了JP-TL-Bench,一个轻量级、开放的基准测试,旨在指导日英翻译系统的迭代开发。在此背景下,挑战通常是“这两个好的翻译中哪一个更好?”,而不是“这个翻译可以接受吗?”。这种区别对于日英翻译至关重要,因为在礼貌程度、言外之意、省略和语域等方面的微妙选择会强烈影响感知的自然度。JP-TL-Bench采用一种旨在使大语言模型评判既可靠又经济的协议:它通过无参考、配对的大语言模型比较,将候选模型与一个固定的、版本化的锚定集合进行对比来评估。配对结果使用Bradley-Terry模型进行聚合,并报告为胜率以及一个归一化的0-10“LT”分数,该分数源自拟合对数强度的逻辑变换。由于每个候选模型都是针对同一个冻结的锚定集合进行评分,因此在给定相同的基础集合、评判模型和聚合代码的情况下,分数在结构上是稳定的。