The Evolution of LLM Adoption in Industry Data Curation Practices

As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants' perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners' current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study addressed distinct research goals, they revealed a broader narrative about evolving LLM usage in aggregate. We discovered an emerging shift in data understanding from heuristic-first, bottom-up approaches to insights-first, top-down workflows supported by LLMs. Furthermore, to respond to a more complex data landscape, data practitioners now supplement traditional subject-expert-created 'golden datasets' with LLM-generated 'silver' datasets and rigorously validated 'super golden' datasets curated by diverse experts. This research sheds light on the transformative role of LLMs in large-scale analysis of unstructured data and highlights opportunities for further tool development.

翻译：随着大型语言模型（LLMs）处理非结构化文本数据的能力日益增强，它们为改进数据治理流程提供了新的机遇。本文探讨了一家大型科技公司从业者采用LLMs的演进过程，通过参与者的认知、整合策略及报告的使用场景，评估LLMs在数据治理任务中的影响。通过一系列调查、访谈和用户研究，我们及时呈现了组织如何应对LLM演进关键时期的现状。2023年第二季度，我们开展了一项调查以评估工业界在开发任务中对LLMs的采用情况（N=84），并于2023年第三季度组织专家访谈以评估不断变化的数据需求（N=10）。2024年第二季度，我们通过一项涉及两个基于LLM的原型系统的用户研究（N=12），探索了从业者当前及预期的LLM使用情况。尽管每项研究针对不同的研究目标，但它们共同揭示了LLM使用方式演进的宏观趋势。我们发现数据理解方式正从启发式优先、自下而上的方法，向LLM支持的洞察优先、自上而下的工作流程转变。此外，为应对更复杂的数据环境，数据从业者现在除了传统由领域专家创建的“黄金数据集”外，还辅以LLM生成的“白银数据集”以及经过严格验证、由多元专家策划的“超级黄金数据集”。本研究揭示了LLMs在非结构化数据大规模分析中的变革性作用，并指出了进一步工具开发的机遇。