While the field of NL2SQL has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline - encompassing data querying, analysis, visualization, and reporting - remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Specifically, SageCopilot incorporates a two-phase design: an online component refining users' inputs into executable scripts through In-Context Learning (ICL) and running the scripts for results reporting & visualization, and an offline preparing demonstrations requested by ICL in the online phase. A list of trending strategies such as Chain-of-Thought and prompt-tuning have been used to augment SageCopilot for enhanced performance. Through rigorous testing and comparative analysis against prompt-based solutions, SageCopilot has been empirically validated to achieve superior end-to-end performance in generating or executing scripts and offering results with visualization, backed by real-world datasets. Our in-depth ablation studies highlight the individual contributions of various components and strategies used by SageCopilot to the end-to-end correctness for data sciences.
翻译:尽管自然语言转SQL(NL2SQL)领域在将自然语言指令转化为可执行SQL脚本以进行数据查询与处理方面取得了显著进展,但在更广泛的数据科学流程(涵盖数据查询、分析、可视化与报告生成)中实现全自动化仍是一项复杂的挑战。本研究介绍了SageCopilot,一个先进的工业级系统,通过集成大型语言模型(LLMs)、自主智能体(AutoAgents)与语言用户界面(LUIs),实现了数据科学流程的自动化。具体而言,SageCopilot采用两阶段设计:在线组件通过上下文学习(ICL)将用户输入精炼为可执行脚本并运行脚本以生成结果报告与可视化;离线组件则为在线阶段的ICL准备所需的演示示例。系统采用了一系列前沿策略,如思维链(Chain-of-Thought)与提示调优(prompt-tuning),以增强SageCopilot的性能。通过严格测试以及与基于提示的解决方案的对比分析,SageCopilot在基于真实世界数据集的脚本生成/执行及提供可视化结果方面,经验证实现了更优的端到端性能。我们深入的消融研究揭示了SageCopilot所采用的各种组件与策略对数据科学端到端正确性的各自贡献。