Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based analytics tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums. We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 194 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.
翻译:关系表(每行对应一个实体,每列对应一个属性)一直是关系型数据库中表的标准结构。然而,在处理“自然存在的”表格时,这一标准并非理所当然。我们对真实电子表格和网络表格的调查显示,超过30%的此类表格不符合关系标准,因此在使用基于SQL的分析工具便捷查询这些表格前,需进行复杂的表结构重组转换。遗憾的是,这些必要转换的编程过程相当复杂,已成为技术和非技术用户的共同痛点——StackOverflow、Excel/Tableau论坛上的大量提问即是明证。我们开发了自动表系统,该系统能自动合成包含多步转换的(Python或其他语言的)管道,将非关系型表格转换为标准关系形式以供下游分析,从而消除用户手动编程转换的需求。通过从用户电子表格和在线论坛收集194个真实测试案例,我们为该新任务编制了全面的基准测试。评估表明,自动表系统能在交互速度下成功为超过70%的测试案例合成转换,而无需用户任何输入,这使其成为技术与非技术用户进行数据分析准备的有效工具。