Non-Linear Scoring Model for Translation Quality Evaluation

Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

翻译：基于多维质量指标（MQM）的分析型翻译质量评估（TQE）传统上采用线性误差-惩罚量表，该量表通常针对1000-2000词的参考样本进行校准。然而，线性外推法对不同规模样本的评判存在偏差，会过度惩罚短样本而低估长样本的惩罚，导致评估结果与专家直觉不符。本文基于多范围框架，提出一种经过校准的非线性评分模型，该模型能更准确地反映人类内容消费者对不同长度样本翻译质量的感知。来自三个大规模企业环境的实证数据表明，可接受的错误数量随样本规模呈对数增长，而非线性增长。心理物理学和认知证据（包括韦伯-费希纳定律和认知负荷理论）支持这一前提，解释了为何额外错误的感知影响会减弱，而认知负担却随规模增加。我们提出了一个双参数模型 E(x) = a * ln(1 + b * x)，其中 a, b > 0，该模型锚定于参考容错率，并通过一维求根步骤从两个容错点进行校准。该模型给出了一个明确的区间，在此区间内线性近似的相对误差保持在 +/-20% 以内，并且只需添加一个动态容错函数即可集成到现有评估工作流中。该方法提升了人类翻译和AI生成翻译在可解释性、公平性和评分者间信度方面的表现。通过实施一种感知有效的评分范式，该研究推动了翻译质量评估向更准确、可扩展的方向发展。该模型还为基于AI的、与人类判断相符的文档级评估提供了更坚实的基础。本文还讨论了其在CAT/LQA系统中的实施考量以及对人类和AI生成文本评估的启示。