We provide a window into the process of constructing a dataset for machine learning (ML) applications by reflecting on the process of building World Wide Dishes (WWD), an image and text dataset consisting of culinary dishes and their associated customs from around the world. WWD takes a participatory approach to dataset creation: community members guide the design of the research process and engage in crowdsourcing efforts to build the dataset. WWD responds to calls in ML to address the limitations of web-scraped Internet datasets with curated, high-quality data incorporating localised expertise and knowledge. Our approach supports decentralised contributions from communities that have not historically contributed to datasets as a result of a variety of systemic factors. We contribute empirical evidence of the invisible labour of participatory design work by analysing reflections from the research team behind WWD. In doing so, we extend computer-supported cooperative work (CSCW) literature that examines the post-hoc impacts of datasets when deployed in ML applications by providing a window into the dataset construction process. We surface four dimensions of invisible labour in participatory dataset construction: building trust with community members, making participation accessible, supporting data production, and understanding the relationship between data and culture. This paper builds upon the rich participatory design literature within CSCW to guide how future efforts to apply participatory design to dataset construction can be designed in a way that attends to the dynamic, collaborative, and fundamentally human processes of dataset creation.
翻译:本文通过反思构建《世界菜肴》(World Wide Dishes,WWD)数据集的过程,为机器学习应用中的数据集构建流程提供了一个观察窗口。WWD是一个包含全球各地菜肴及其相关习俗的图像与文本数据集,其构建采用了参与式方法:社区成员指导研究流程的设计,并通过众包方式参与数据集的构建。该研究响应了机器学习领域对高质量、经人工筛选数据的呼吁,旨在弥补网络爬取数据集因缺乏本地化专业知识与知识而产生的局限性。我们的方法支持因各种系统性因素而历史上未参与数据集构建的社区进行去中心化贡献。通过分析WWD研究团队的反思记录,我们为参与式设计工作中隐性劳动的存在提供了实证依据。由此,我们拓展了计算机支持协同工作(CSCW)领域关于数据集在机器学习应用中部署后影响的研究,揭示了数据集构建过程的内在机制。我们揭示了参与式数据集构建中隐性劳动的四个维度:与社区成员建立信任、确保参与可及性、支持数据生产、理解数据与文化的关系。本文基于CSCW领域丰富的参与式设计文献,为未来在数据集构建中应用参与式设计提供了指导框架,强调应关注数据集创建过程中动态、协作且本质上属于人类活动的特性。