Increasing trust in new data sources: crowdsourcing image classification for ecology

Crowdsourcing methods facilitate the production of scientific information by non-experts. This form of citizen science (CS) is becoming a key source of complementary data in many fields to inform data-driven decisions and study challenging problems. However, concerns about the validity of these data often constrain their utility. In this paper, we focus on the use of citizen science data in addressing complex challenges in environmental conservation. We consider this issue from three perspectives. First, we present a literature scan of papers that have employed Bayesian models with citizen science in ecology. Second, we compare several popular majority vote algorithms and introduce a Bayesian item response model that estimates and accounts for participants' abilities after adjusting for the difficulty of the images they have classified. The model also enables participants to be clustered into groups based on ability. Third, we apply the model in a case study involving the classification of corals from underwater images from the Great Barrier Reef, Australia. We show that the model achieved superior results in general and, for difficult tasks, a weighted consensus method that uses only groups of experts and experienced participants produced better performance measures. Moreover, we found that participants learn as they have more classification opportunities, which substantially increases their abilities over time. Overall, the paper demonstrates the feasibility of CS for answering complex and challenging ecological questions when these data are appropriately analysed. This serves as motivation for future work to increase the efficacy and trustworthiness of this emerging source of data.

翻译：众包方法可促进非专家参与科学信息的生成。这种公民科学形式正成为众多领域中补充数据的关键来源，为数据驱动决策和挑战性问题的研究提供支持。然而，关于这些数据有效性的疑虑往往限制了其应用价值。本文聚焦于公民科学数据在环境保育复杂挑战中的应用，从三个维度展开探讨：首先，通过文献扫描梳理生态学中采用贝叶斯模型与公民科学数据的研究；其次，对比多种主流多数投票算法，并引入一种贝叶斯项目反应模型——该模型在调整图像分类难度后，能评估并校正参与者的能力差异，同时依据能力水平将参与者聚类分组；最后，将模型应用于澳大利亚大堡礁水下珊瑚图像分类的案例研究。结果表明，该模型整体表现优异，且在困难任务中，仅采用专家与经验参与者群体的加权共识方法可获得更优性能指标。此外，研究还发现参与者随分类机会增多而持续学习，其能力随时间显著提升。总体而言，本文证明了当数据得到适当分析时，公民科学在解决复杂生态问题中的可行性，为提升这一新兴数据源的有效性与可信度提供了未来研究的动力。