In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce ``Student's \textit{t}-Distribution'' method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give quantitative analysis on how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student's \textit{t}-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This \textit{t}-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce. Keywords: Inter-Rater Reliability (IRR); Scarce Observations; Confidence Intervals (CIs); Natural Language Processing (NLP); Translation Quality Evaluation (TQE); Student's \textit{t}-Distribution
翻译:在自然语言处理(NLP)中,我们始终依赖人工判断作为黄金质量评估方法。然而,针对特定评估任务(如翻译质量评估TQE),尤其当数据样本(观测值)极其稀缺时,如何更好地评估评分者间信度(IRR)水平一直存在持续争论。本研究首先探讨仅有一个数据(评估)点时如何估算测量值的置信区间。继而引入包含两个人工生成的观测得分的案例,为此我们提出"学生t-分布"方法,阐释如何仅用这两个数据点测量IRR分数及质量评估的置信区间(CIs)。本研究定量分析表明,即使仅增加一个额外观测值,评估置信度也能显著提升。我们鼓励研究人员尽一切可能报告IRR分数,例如优先采用学生t-分布方法,从而使NLP评估更具意义、透明性和可信度。该t-分布方法还可应用于NLP领域之外,在观测数据稀缺时用于衡量实验研究的可信评估IRR水平。关键词:评分者间信度(IRR);稀缺观测;置信区间(CIs);自然语言处理(NLP);翻译质量评估(TQE);学生t-分布