Supervised machine learning utilizes large datasets, often with ground truth labels annotated by humans. While some data points are easy to classify, others are hard to classify, which reduces the inter-annotator agreement. This causes noise for the classifier and might affect the user's perception of the classifier's performance. In our research, we investigated whether the classification difficulty of a data point influences how strongly a prediction mistake reduces the "perceived accuracy". In an experimental online study, 225 participants interacted with three fictive classifiers with equal accuracy (73%). The classifiers made prediction mistakes on three different types of data points (easy, difficult, impossible). After the interaction, participants judged the classifier's accuracy. We found that not all prediction mistakes reduced the perceived accuracy equally. Furthermore, the perceived accuracy differed significantly from the calculated accuracy. To conclude, accuracy and related measures seem unsuitable to represent how users perceive the performance of classifiers.
翻译:监督式机器学习利用大规模数据集,通常包含由人工标注的真实标签。部分数据点易于分类,而另一些则难以分类,这会降低标注者间一致性。这种不一致性为分类器引入噪声,并可能影响用户对分类器性能的感知。本研究探讨了数据点的分类难度是否会改变预测错误对“感知准确性”的削弱程度。在一项在线实验研究中,225名参与者与三个准确率相同(73%)的虚拟分类器进行交互。这些分类器在三种不同类型的数据点(简单、困难、不可能)上均出现了预测错误。交互结束后,参与者对分类器的准确性进行了评估。我们发现并非所有预测错误对感知准确性的削弱效果相同。此外,感知准确性与实际计算准确性之间存在显著差异。结论表明,准确率及相关指标似乎并不适合反映用户对分类器性能的感知。