A number of labeling systems based on text have been proposed to help monitor work on the United Nations (UN) Sustainable Development Goals (SDGs). Here, we present a systematic comparison of systems using a variety of text sources and show that systems differ considerably in their specificity (i.e., true-positive rate) and sensitivity (i.e., true-negative rate), have systematic biases (e.g., are more sensitive to specific SDGs relative to others), and are susceptible to the type and amount of text analyzed. We then show that an ensemble model that pools labeling systems alleviates some of these limitations, exceeding the labeling performance of all currently available systems. We conclude that researchers and policymakers should care about the choice of labeling system and that ensemble methods should be favored when drawing conclusions about the absolute and relative prevalence of work on the SDGs based on automated methods.
翻译:现有多种基于文本的标注系统被提出,用于监测联合国可持续发展目标的相关工作。本文系统比较了基于不同文本来源的系统,发现这些系统的特异性(即真阳性率)和敏感性(即真阴性率)存在显著差异,存在系统性偏差(例如对某些可持续发展目标更敏感),且受分析文本的类型和数量影响。我们进一步证明,集成多种标注系统的集成模型能够缓解上述部分局限性,其标注性能超越现有所有系统。研究结论表明,研究人员和政策制定者应重视标注系统的选择,并在基于自动化方法推断可持续发展目标相关工作的绝对与相对分布时,优先采用集成方法。