Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned

While the field of NL2SQL has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline - encompassing data querying, analysis, visualization, and reporting - remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Specifically, SageCopilot incorporates a two-phase design: an online component refining users' inputs into executable scripts through In-Context Learning (ICL) and running the scripts for results reporting & visualization, and an offline preparing demonstrations requested by ICL in the online phase. A list of trending strategies such as Chain-of-Thought and prompt-tuning have been used to augment SageCopilot for enhanced performance. Through rigorous testing and comparative analysis against prompt-based solutions, SageCopilot has been empirically validated to achieve superior end-to-end performance in generating or executing scripts and offering results with visualization, backed by real-world datasets. Our in-depth ablation studies highlight the individual contributions of various components and strategies used by SageCopilot to the end-to-end correctness for data sciences.

翻译：尽管自然语言转SQL（NL2SQL）领域在将自然语言指令转化为可执行SQL脚本以进行数据查询与处理方面取得了显著进展，但在更广泛的数据科学流程（涵盖数据查询、分析、可视化与报告生成）中实现全自动化仍是一项复杂的挑战。本研究介绍了SageCopilot，一个先进的工业级系统，通过集成大型语言模型（LLMs）、自主智能体（AutoAgents）与语言用户界面（LUIs），实现了数据科学流程的自动化。具体而言，SageCopilot采用两阶段设计：在线组件通过上下文学习（ICL）将用户输入精炼为可执行脚本并运行脚本以生成结果报告与可视化；离线组件则为在线阶段的ICL准备所需的演示示例。系统采用了一系列前沿策略，如思维链（Chain-of-Thought）与提示调优（prompt-tuning），以增强SageCopilot的性能。通过严格测试以及与基于提示的解决方案的对比分析，SageCopilot在基于真实世界数据集的脚本生成/执行及提供可视化结果方面，经验证实现了更优的端到端性能。我们深入的消融研究揭示了SageCopilot所采用的各种组件与策略对数据科学端到端正确性的各自贡献。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日