Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses-unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use "gold standard" validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates. We introduce and test such methods, including a new method we design and implement in the R package misclassificationmodels, via Monte Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
翻译:自动化分类器(ACs)常通过监督式机器学习(SML)构建,能够对从文本到图像和视频等大样本数据进行统计上具有显著效力的分类,并已成为传播学及相关领域广泛使用的测量工具。然而,即使高精度分类器也难免产生错误,这些错误会导致下游分析中出现误分类偏差和误导性结果——除非此类分析考虑了这些误差。通过对SML应用的系统性文献综述,我们发现传播学学者在很大程度上忽视了误分类偏差。理论上,现有统计方法可利用“金标准”验证数据(例如由人工标注员创建的数据)来校正误分类偏差,并产生一致的估计值。我们通过蒙特卡洛模拟引入并测试了此类方法(包括我们在R包misclassificationmodels中设计实现的新方法),旨在揭示各方法的局限性,并将相关模拟代码一并公开。基于结果,我们推荐新开发的纠偏方法,因其兼具通用性与高效性。总体而言,自动化分类器(即使精度低于常规标准或存在系统性误分类)在结合严谨的研究设计与适当的纠偏方法后,仍可有效用于测量。