Humans tend to strongly agree on ratings on a scale for extreme cases (e.g., a CAT is judged as very concrete), but judgements on mid-scale words exhibit more disagreement. Yet, collected rating norms are heavily exploited across disciplines. Our study focuses on concreteness ratings and (i) implements correlations and supervised classification to identify salient multi-modal characteristics of mid-scale words, and (ii) applies a hard clustering to identify patterns of systematic disagreement across raters. Our results suggest to either fine-tune or filter mid-scale target words before utilising them.
翻译:人类对于极端案例的量表评分往往高度一致(例如,“猫”被判定为非常具体),但中等量表词汇的判断则表现出更多分歧。然而,收集到的评分规范在各学科中被广泛使用。本研究聚焦于具体性评分,(i)通过相关分析和监督分类识别中等量表词汇的显著多模态特征,(ii)采用硬聚类识别评分者之间系统性分歧的模式。我们的结果表明,在利用中等量表目标词汇之前,需对其进行微调或过滤。