The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes

Siobhan Mackenzie Hall,Samantha Dalal,Raesetje Sefala,Foutse Yuehgoh,Aisha Alaagib,Imane Hamzaoui,Shu Ishida,Jabez Magomere,Lauren Crais,Aya Salama,Tejumade Afonja

We provide a window into the process of constructing a dataset for machine learning (ML) applications by reflecting on the process of building World Wide Dishes (WWD), an image and text dataset consisting of culinary dishes and their associated customs from around the world. WWD takes a participatory approach to dataset creation: community members guide the design of the research process and engage in crowdsourcing efforts to build the dataset. WWD responds to calls in ML to address the limitations of web-scraped Internet datasets with curated, high-quality data incorporating localised expertise and knowledge. Our approach supports decentralised contributions from communities that have not historically contributed to datasets as a result of a variety of systemic factors. We contribute empirical evidence of the invisible labour of participatory design work by analysing reflections from the research team behind WWD. In doing so, we extend computer-supported cooperative work (CSCW) literature that examines the post-hoc impacts of datasets when deployed in ML applications by providing a window into the dataset construction process. We surface four dimensions of invisible labour in participatory dataset construction: building trust with community members, making participation accessible, supporting data production, and understanding the relationship between data and culture. This paper builds upon the rich participatory design literature within CSCW to guide how future efforts to apply participatory design to dataset construction can be designed in a way that attends to the dynamic, collaborative, and fundamentally human processes of dataset creation.

翻译：本文通过反思构建《世界菜肴》（World Wide Dishes，WWD）数据集的过程，为机器学习应用中的数据集构建流程提供了一个观察窗口。WWD是一个包含全球各地菜肴及其相关习俗的图像与文本数据集，其构建采用了参与式方法：社区成员指导研究流程的设计，并通过众包方式参与数据集的构建。该研究响应了机器学习领域对高质量、经人工筛选数据的呼吁，旨在弥补网络爬取数据集因缺乏本地化专业知识与知识而产生的局限性。我们的方法支持因各种系统性因素而历史上未参与数据集构建的社区进行去中心化贡献。通过分析WWD研究团队的反思记录，我们为参与式设计工作中隐性劳动的存在提供了实证依据。由此，我们拓展了计算机支持协同工作（CSCW）领域关于数据集在机器学习应用中部署后影响的研究，揭示了数据集构建过程的内在机制。我们揭示了参与式数据集构建中隐性劳动的四个维度：与社区成员建立信任、确保参与可及性、支持数据生产、理解数据与文化的关系。本文基于CSCW领域丰富的参与式设计文献，为未来在数据集构建中应用参与式设计提供了指导框架，强调应关注数据集创建过程中动态、协作且本质上属于人类活动的特性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日