Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage is under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.
翻译:大型语言模型的能力源自其预训练数据,模型开发始于数据筛选。然而,在初始阶段保留或移除哪些数据的决策却鲜受关注。本研究将作为主要预训练数据来源的网络文本置于其社会与地理背景中,构建包含1030万条网站创建者自我描述的新数据集,提取其身份与地域信息:主题兴趣、社会角色及地理归属。我们首次系统探究十种"质量"过滤与英语语言识别过滤器如何影响不同社会维度的网页。实验揭示了数据筛选中的隐性偏好:部分质量分类器实际充当主题领域过滤器,而语言识别可能会忽略来自世界某些区域的英文内容。总体上,我们希望本研究能推动关于预训练数据筛选实践及其社会影响的新研究方向。