High-quality exploratory data analysis (EDA) is essential in the data science pipeline, but remains highly dependent on analysts' expertise and effort. While recent LLM-based approaches partially reduce this burden, they struggle to generate effective analysis plans and appropriate insights and visualizations when user intent is abstract. Meanwhile, a vast collection of analysis notebooks produced across platforms and organizations contains rich analytical knowledge that can potentially guide automated EDA. Retrieval-augmented generation (RAG) provides a natural way to leverage such corpora, but general methods often treat notebooks as static documents and fail to fully exploit their potential knowledge for automating EDA. To address these limitations, we propose NotebookRAG, a method that takes user intent, datasets, and existing notebooks as input to retrieve, enhance, and reuse relevant notebook content for automated EDA generation. For retrieval, we transform code cells into context-enriched executable components, which improve retrieval quality and enable rerun with new data to generate updated visualizations and reliable insights. For generation, an agent leverages enhanced retrieval content to construct effective EDA plans, derive insights, and produce appropriate visualizations. Evidence from a user study with 24 participants confirms the superiority of our method in producing high-quality and intent-aligned EDA notebooks.
翻译:高质量的探索性数据分析(EDA)在数据科学流程中至关重要,但其质量仍高度依赖于分析人员的专业知识和投入。尽管近期基于大语言模型(LLM)的方法部分减轻了这一负担,但当用户意图较为抽象时,这些方法难以生成有效的分析计划、恰当的洞察与可视化。与此同时,跨平台与组织产生的大量分析笔记本蕴含着丰富的分析知识,有望为自动化EDA提供指导。检索增强生成(RAG)为利用此类语料库提供了自然途径,但通用方法通常将笔记本视为静态文档,未能充分挖掘其用于自动化EDA的潜在知识。为应对这些局限,我们提出NotebookRAG方法,该方法以用户意图、数据集及现有笔记本作为输入,通过检索、增强与复用相关笔记本内容来生成自动化EDA。在检索阶段,我们将代码单元转化为上下文增强的可执行组件,从而提升检索质量,并支持在新数据上重新运行以生成更新的可视化结果与可靠洞察。在生成阶段,智能体利用增强的检索内容构建有效的EDA计划、推导洞察并生成恰当的可视化。一项包含24名参与者的用户研究证据证实,本方法在生成高质量且符合意图的EDA笔记本方面具有优越性。