Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as the gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS, the first Uncertainty-aware STS dataset with ~15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgements adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.
翻译:尽管语义文本相似度(STS)具有主观性且标注中存在普遍分歧,现有基准数据集仍将平均人工评分作为黄金标准。这种平均化处理掩盖了低一致性示例中人类意见的真实分布,使得模型无法捕捉个体评分所代表的语义模糊性。本文引入USTS——首个包含约1.5万中文句对和15万个标签的不确定性感知STS数据集,用于研究STS中的集体人类意见。分析表明,单一标量或单一高斯分布均无法充分拟合观测到的判断集合。我们进一步揭示,当前STS模型无法捕捉由个体实例上的人类分歧导致的方差,其反映的是聚合数据集上的预测置信度。