Data exploration is a challenging process in which users examine a dataset by iteratively employing a series of queries. While in some cases the user explores a new dataset to become familiar with it, more often, the exploration process is conducted with a specific analysis goal or question in mind. To assist users in exploring a new dataset, Automated Data Exploration (ADE) systems have been devised in previous work. These systems aim to auto-generate a full exploration session, containing a sequence of queries that showcase interesting elements of the data. However, existing ADE systems are often constrained by a predefined objective function, thus always generating the same session for a given dataset. Therefore, their effectiveness in goal-oriented exploration, in which users need to answer specific questions about the data, are extremely limited. To this end, this paper presents LINX, a generative system augmented with a natural language interface for goal-oriented ADE. Given an input dataset and an analytical goal described in natural language, LINX generates a personalized exploratory session that is relevant to the user's goal. LINX utilizes a Large Language Model (LLM) to interpret the input analysis goal, and then derive a set of specifications for the desired output exploration session. These specifications are then transferred to a novel, modular ADE engine based on Constrained Deep Reinforcement Learning (CDRL), which can adapt its output according to the specified instructions. To validate LINX's effectiveness, we introduce a new benchmark dataset for goal-oriented exploration and conduct an extensive user study. Our analysis underscores LINX's superior capability in producing exploratory notebooks that are significantly more relevant and beneficial than those generated by existing solutions, including ChatGPT, goal-agnostic ADE, and commercial systems.
翻译:数据探索是一个具有挑战性的过程,用户通过迭代执行一系列查询来检查数据集。虽然在某些情况下用户探索新数据集是为了熟悉它,但更多时候,探索过程是带着特定的分析目标或问题进行的。为协助用户探索新数据集,前人工作中已设计了自动数据探索(ADE)系统。这些系统旨在自动生成完整的探索会话,其中包含一系列展示数据中有趣元素的查询。然而,现有ADE系统通常受限于预定义的目标函数,因此对给定数据集总是生成相同的会话。因此,它们在目标导向探索(用户需要回答关于数据的特定问题)中的有效性极为有限。为此,本文提出了LINX,一个增强自然语言接口的生成式系统,用于目标导向ADE。给定输入数据集和以自然语言描述的分析目标,LINX生成与用户目标相关的个性化探索会话。LINX利用大语言模型(LLM)解释输入的分析目标,然后推导出所需输出探索会话的一组规范。这些规范随后被传输至一个基于约束深度强化学习(CDRL)的新型模块化ADE引擎,该引擎能根据指定指令调整其输出。为验证LINX的有效性,我们引入了一个新的目标导向探索基准数据集,并开展了广泛的用户研究。我们的分析凸显了LINX在生成探索性笔记本方面的卓越能力,这些笔记本比现有解决方案(包括ChatGPT、目标无关ADE及商业系统)生成的笔记本更具相关性和实用性。