Automating end-to-end Exploratory Data Analysis (AutoEDA) is a challenging open problem, often tackled through Reinforcement Learning (RL) by learning to predict a sequence of analysis operations (FILTER, GROUP, etc). Defining rewards for each operation is a challenging task and existing methods rely on various \emph{interestingness measures} to craft reward functions to capture the importance of each operation. In this work, we argue that not all of the essential features of what makes an operation important can be accurately captured mathematically using rewards. We propose an AutoEDA model trained through imitation learning from expert EDA sessions, bypassing the need for manually defined interestingness measures. Our method, based on generative adversarial imitation learning (GAIL), generalizes well across datasets, even with limited expert data. We also introduce a novel approach for generating synthetic EDA demonstrations for training. Our method outperforms the existing state-of-the-art end-to-end EDA approach on benchmarks by upto 3x, showing strong performance and generalization, while naturally capturing diverse interestingness measures in generated EDA sessions.
翻译:自动化端到端探索性数据分析(AutoEDA)是一个具有挑战性的开放问题,通常通过强化学习方法学习预测一系列分析操作(如筛选、分组等)来解决。为每个操作定义奖励是一项困难的任务,现有方法依赖各种“兴趣度度量”来设计奖励函数以捕捉每个操作的重要性。本文认为,并非所有决定操作重要性的本质特征都能通过数学方式使用奖励函数精确捕捉。我们提出了一种通过模仿专家EDA会话进行训练的AutoEDA模型,从而绕过了手动定义兴趣度度量的需求。我们的方法基于生成对抗模仿学习(GAIL),即使在专家数据有限的情况下,也能在不同数据集上表现出良好的泛化能力。我们还提出了一种生成合成EDA演示数据用于训练的新方法。在基准测试中,我们的方法性能优于现有最先进的端到端EDA方法,提升幅度最高达3倍,展现了强大的性能与泛化能力,同时自然地在生成的EDA会话中捕捉了多样化的兴趣度度量。