Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based analytics tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Power-BI/Tableau forums. We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.
翻译:摘要:关系表(每行对应一个实体,每列对应一个属性)是关系数据库中表格的标准形式。然而,在处理“自然存在”的表格时,这一标准并不总是成立。我们对真实电子表格和网络表格的调查表明,超过30%的此类表格不符合关系标准,因此在能够使用基于SQL的分析工具轻松查询这些表格之前,需要复杂的表格重构转换。遗憾的是,所需的转换编程难度较大,这已成为技术用户和非技术用户共同的显著痛点,这在StackOverflow、Excel/Power-BI/Tableau论坛等平台的大量提问中可见一斑。我们开发了Auto-Tables系统,该系统可自动合成包含多步转换的流程(使用Python或其他语言),将非关系表转换为下游分析所需的标准关系形式,从而免除用户手动编程转换的需求。我们为此新任务编译了一个广泛的基准测试集,从用户电子表格和在线论坛中收集了244个真实测试用例。评估结果表明,Auto-Tables能够以交互式速度成功为超过70%的测试用例合成转换,且无需用户任何输入,这使其成为技术用户和非技术用户准备分析数据的有效工具。