Exploratory data analysis (EDA), coupled with SQL, is essential for data analysts involved in data exploration and analysis. However, data analysts often encounter two primary challenges: (1) the need to craft SQL queries skillfully, and (2) the requirement to generate suitable visualization types that enhance the interpretation of query results. Due to its significance, substantial research efforts have been made to explore different approaches to address these challenges, including leveraging large language models (LLMs). However, existing methods fail to meet real-world data exploration requirements primarily due to (1) complex database schema; (2) unclear user intent; (3) limited cross-domain generalization capability; and (4) insufficient end-to-end text-to-visualization capability. This paper presents TiInsight, an automated SQL-based cross-domain exploratory data analysis system. First, we propose hierarchical data context (i.e., HDC), which leverages LLMs to summarize the contexts related to the database schema, which is crucial for open-world EDA systems to generalize across data domains. Second, the EDA system is divided into four components (i.e., stages): HDC generation, question clarification and decomposition, text-to-SQL generation (i.e., TiSQL), and data visualization (i.e., TiChart). Finally, we implemented an end-to-end EDA system with a user-friendly GUI interface in the production environment at PingCAP. We have also open-sourced all APIs of TiInsight to facilitate research within the EDA community. Through extensive evaluations by a real-world user study, we demonstrate that TiInsight offers remarkable performance compared to human experts. Specifically, TiSQL achieves an execution accuracy of 86.3% on the Spider dataset using GPT-4. It also demonstrates state-of-the-art performance on the Bird dataset.
翻译:探索性数据分析(EDA)结合SQL,对于从事数据探索与分析的数据分析师至关重要。然而,数据分析师通常面临两大主要挑战:(1)需要熟练地构建SQL查询语句;(2)需要生成合适的可视化类型以增强对查询结果的解读。鉴于其重要性,已有大量研究工作探索了应对这些挑战的不同方法,包括利用大型语言模型(LLMs)。然而,现有方法主要由于以下原因未能满足实际数据探索需求:(1)复杂的数据库模式;(2)不明确的用户意图;(3)有限的跨领域泛化能力;(4)端到端文本到可视化能力不足。本文提出了TiInsight,一个基于SQL的自动化跨领域探索性数据分析系统。首先,我们提出了分层数据上下文(即HDC),它利用LLMs来总结与数据库模式相关的上下文,这对于开放世界EDA系统实现跨数据领域的泛化至关重要。其次,该EDA系统被划分为四个组件(即阶段):HDC生成、问题澄清与分解、文本到SQL生成(即TiSQL)以及数据可视化(即TiChart)。最后,我们在PingCAP的生产环境中实现了一个具有友好图形用户界面(GUI)的端到端EDA系统。我们还将TiInsight的所有API开源,以促进EDA社区内的研究。通过实际用户研究的广泛评估,我们证明TiInsight相较于人类专家展现出卓越的性能。具体而言,TiSQL在Spider数据集上使用GPT-4实现了86.3%的执行准确率。同时,它在Bird数据集上也展现了最先进的性能。