非线性评分模型在翻译质量评估中的应用 (Non-Linear Scoring Model for Translation Quality Evaluation)

Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

翻译：基于多维质量度量（MQM）的分析型翻译质量评估（TQE）传统上采用线性误差-惩罚比例，该比例通常基于1000-2000词的参考样本进行校准。然而，线性外推法会对不同长度的样本产生判断偏差：对短样本惩罚过重，对长样本惩罚不足，导致评估结果与专家直觉不一致。本文基于多范围评估框架，提出一种经过校准的非线性评分模型，能更准确地反映内容消费者对不同长度翻译样本的质量感知。来自三个大规模企业环境的实证数据表明，可接受的错误数量随样本规模呈对数增长而非线性增长。心理物理学与认知科学证据（包括韦伯-费希纳定律和认知负荷理论）支持这一前提，解释了为何额外错误的感知影响会递减，而认知负担却随规模增长。我们提出双参数模型E(x) = a * ln(1 + b * x)，其中a, b > 0，该模型以参考容错率为基准，通过两个容错点采用一维求根步骤进行校准。该模型可推导出明确的区间范围，在此区间内线性近似能保持±20%的相对误差，并且仅需增加动态容错函数即可集成到现有评估流程中。该方法提升了人工翻译与AI生成翻译在可解释性、公平性和评分者间信度方面的表现。通过实施符合感知规律的评分范式，推动了翻译质量评估向更精准、可扩展的方向发展。该模型还为基于AI的文档级评估提供了更坚实的理论基础，使其更符合人类判断标准。文中还探讨了在CAT/LQA系统中的实施考量，以及对人工与AI生成文本评估的启示。