Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!

from arxiv, 41 page, 21 Figures, Top Paper Award from the 2023 Annual Meeting of The International Communication Association Computational Methods Division

Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses-unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use "gold standard" validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates. We introduce and test such methods, including a new method we design and implement in the R package misclassificationmodels, via Monte Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.

翻译：自动化分类器（ACs）常通过监督式机器学习（SML）构建，能够对从文本到图像和视频等大样本数据进行统计上具有显著效力的分类，并已成为传播学及相关领域广泛使用的测量工具。然而，即使高精度分类器也难免产生错误，这些错误会导致下游分析中出现误分类偏差和误导性结果——除非此类分析考虑了这些误差。通过对SML应用的系统性文献综述，我们发现传播学学者在很大程度上忽视了误分类偏差。理论上，现有统计方法可利用“金标准”验证数据（例如由人工标注员创建的数据）来校正误分类偏差，并产生一致的估计值。我们通过蒙特卡洛模拟引入并测试了此类方法（包括我们在R包misclassificationmodels中设计实现的新方法），旨在揭示各方法的局限性，并将相关模拟代码一并公开。基于结果，我们推荐新开发的纠偏方法，因其兼具通用性与高效性。总体而言，自动化分类器（即使精度低于常规标准或存在系统性误分类）在结合严谨的研究设计与适当的纠偏方法后，仍可有效用于测量。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日