We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
翻译:我们报告了在确定一组适合进行协调研究的NLP前人人工评估方面的努力,旨在探究哪些因素使NLP人工评估更具/更不具可复现性。我们呈现的结果与发现包括:仅有13%的论文满足(i)复现障碍足够低,且(ii)可获得足够信息以纳入复现考量;而所有我们选定尝试复现的实验(除一项外)均被发现存在使复现意义存疑的缺陷。因此,我们不得不将协调研究设计从"复现"方法转向"先标准化再双重复现"方法。我们的总体(负面)发现表明,NLP中绝大多数人工评估不可重复和/或不可复现和/或缺陷过多以致无法为复现提供合理性——这一现状虽描绘出令人忧虑的景象,却也为反思NLP人工评估的设计与报告方式提供了契机。