Tabular data, as a crucial form of data representation, exists in diverse formats on the Web. When confronted with complex and irregular tables, manual modification becomes a laborious task. This paper investigates the performance of Large Language Models (LLMs) in the context of table editing tasks. Existing research mainly focuses on regular-shaped tables, wherein instructions are used to generate code in SQL, Python, or Excel Office-script for manipulating the tables. Nevertheless, editing tables with irregular structures, particularly those containing merged cells spanning multiple rows, poses a challenge when using code. To address this, we introduce the WikiTableEdit dataset. Leveraging 26,531 tables from the WikiSQL dataset, we automatically generate natural language instructions for six distinct basic operations and the corresponding outcomes, resulting in over 200,000 instances. Subsequently, we evaluate several representative large language models on the WikiTableEdit dataset to demonstrate the challenge of this task. The dataset will be released to the community to promote related researches.
翻译:表格数据作为数据表示的重要形式,在网络上以多种格式存在。面对复杂且不规则的表格时,手动修改成为一项繁重的工作。本文研究了大型语言模型(LLMs)在表格编辑任务中的表现。现有研究主要关注规则形状的表格,通过指令生成SQL、Python或Excel Office-script代码来操作表格。然而,编辑结构不规则的表格(特别是包含跨多行合并单元格的表格)在使用代码时存在挑战。为解决这一问题,我们引入了WikiTableEdit数据集。利用WikiSQL数据集的26,531个表格,我们自动生成了六种不同基本操作的自然语言指令及对应结果,创建了超过20万个案例。随后,我们在WikiTableEdit数据集上评估了多个代表性大型语言模型,以展示该任务的挑战性。该数据集将向社区开放,以推动相关研究。