AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.
翻译:人工智能工具日益部署在社区环境中。然而,用于评估AI的数据集通常由开发者或社区外的标注者创建,这可能导致关于AI性能的误导性结论。我们如何赋能社区,以主导对其产生影响的AI评估数据集的意图性设计与策展?我们在维基百科这一拥有多项基于AI的内容审核工具的在线社区中探究该问题。我们提出Wikibench系统,使社区能够通过讨论来应对歧义与视角差异,从而协作策展AI评估数据集。在维基百科上的实地研究表明,使用Wikibench策展的数据集能有效捕捉社区共识、分歧与不确定性。此外,研究参与者使用Wikibench来塑造整体数据策展流程,包括细化标签定义、确定数据纳入标准以及编写数据声明。基于我们的发现,我们提出了支持社区驱动数据策展的系统的未来方向。