Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage is under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.
翻译:大型语言模型的能力源于其预训练数据,而模型开发始终始于数据筛选。然而,在此初始阶段关于保留或删除哪些数据的决策过程仍缺乏深入审视。本研究将作为主流预训练数据来源的网页文本锚定至其社会与地理情境,构建了包含1030万条网站创建者自我描述的新数据集,提取其身份与地域信息:主题兴趣、社会角色及地理归属。我们首次系统探究了十种"质量"筛选器与英语语言识别筛选器如何影响具有不同社会维度的网页。实验揭示了数据筛选中一系列隐性偏好:部分质量分类器实际发挥主题领域过滤器的作用,而语言识别筛选器可能遗漏全球某些地区的英语内容。总体而言,我们期望本研究能推动关于预训练数据筛选实践及其社会影响的新研究方向。