As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.
翻译:随着大语言模型(LLM)日益先进且影响力不断增强,审视其所依赖和产生的数据变得愈发重要。从事这项工作的数据集从业者究竟意味着什么?我们分两部分探讨这一问题:首先,通过对科技公司Google内参与LLM开发团队职责的回顾性分析,界定了"数据集从业者"的角色定义。随后,我们对这些从业者的跨部门样本(N=10)进行了半结构化访谈。研究发现,尽管数据质量被列为最高优先级,但关于"数据质量"的定义及其评估方法却鲜有共识。因此,从业者要么依赖自身直觉,要么编写自定义代码来评估数据。我们讨论了这一现象的可能成因及促进标准统一的机会。