Human label variation (HLV) challenges the standard assumption that a labelled instance has a single ground truth, instead embracing the natural variation in human annotation to train and evaluate models. While various training methods and metrics for HLV have been proposed, it is still unclear which methods and metrics perform best in what settings. We propose new evaluation metrics for HLV leveraging fuzzy set theory. Since these new proposed metrics are differentiable, we then in turn experiment with employing these metrics as training objectives. We conduct an extensive study over 6 HLV datasets testing 14 training methods and 6 evaluation metrics. We find that training on either disaggregated annotations or soft labels performs best across metrics, outperforming training using the proposed training objectives with differentiable metrics. We also show that our proposed soft metric is more interpretable and correlates best with human preference.
翻译:人类标签变异挑战了标注实例具有单一真实标签的标准假设,转而接纳人类标注中存在的自然变异来训练和评估模型。尽管已有多种针对HLV的训练方法和评估指标被提出,但何种方法与指标在何种场景下表现最佳仍不明确。我们基于模糊集理论提出了新的HLV评估指标。由于这些新提出的指标具有可微性,我们进一步尝试将其作为训练目标进行实验。我们在6个HLV数据集上开展了大规模研究,测试了14种训练方法和6种评估指标。研究发现:基于分解标注或软标签的训练方法在所有指标上均表现最佳,其效果优于使用可微指标作为训练目标的方案。我们还证明,我们提出的软指标更具可解释性,且与人类偏好具有最佳相关性。