In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce ``Student's \textit{t}-Distribution'' method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give quantitative analysis on how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student's \textit{t}-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This \textit{t}-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce. Keywords: Inter-Rater Reliability (IRR); Scarce Observations; Confidence Intervals (CIs); Natural Language Processing (NLP); Translation Quality Evaluation (TQE); Student's \textit{t}-Distribution
翻译:在自然语言处理领域,我们始终依赖人工评判作为质量评估的金标准。然而,针对特定评估任务(如翻译质量评估)如何更有效地计算评分者间信度水平,学界一直存在争论——尤其是在数据样本(观测值)极为稀疏的情况下。本研究首先探讨仅有一个数据(评估)点时测量值的置信区间估计方法,进而引出包含两个人工观测评分的典型案例。针对该案例,我们引入"Student’s t-分布"方法,阐释如何仅凭这两个数据点计算IRR值及质量评估的置信区间。通过定量分析证明,即使仅新增一个观测值,评估置信度也能显著提升。我们鼓励研究者采用所有可行手段报告IRR分数(例如优先使用Student’s t-分布方法),从而推动NLP评估更具意义性、透明性与可信度。该t-分布方法同样适用于NLP领域之外,可服务于实验研究中稀疏观测数据场景下的可信评估IRR水平计算。关键词:评分者间信度;稀疏观测数据;置信区间;自然语言处理;翻译质量评估;Student’s t-分布