Missing data in tabular dataset is a common issue as the performance of downstream tasks usually depends on the completeness of the training dataset. Previous missing data imputation methods focus on numeric and categorical columns, but we propose a novel end-to-end approach called Table Transformers for Imputing Textual Attributes (TTITA) based on the transformer to impute unstructured textual columns using other columns in the table. We conduct extensive experiments on three datasets, and our approach shows competitive performance outperforming baseline models such as recurrent neural networks and Llama2. The performance improvement is more significant when the target sequence has a longer length. Additionally, we incorporate multi-task learning to simultaneously impute for heterogeneous columns, boosting the performance for text imputation. We also qualitatively compare with ChatGPT for realistic applications.
翻译:表格数据集中存在缺失数据是一个常见问题,因为下游任务的性能通常取决于训练数据集的完整性。以往的缺失数据填补方法主要关注数值型和分类型列,而我们提出了一种基于Transformer的新型端到端方法——表格文本属性补全Transformer(TTITA),利用表中其他列来填补非结构化文本列。我们在三个数据集上进行了大量实验,结果表明该方法具有竞争优势,其性能优于循环神经网络和Llama2等基线模型。当目标序列长度较长时,性能提升更为显著。此外,我们引入多任务学习机制,能够同时处理异构列的填补任务,从而提升文本填补的性能。我们还与ChatGPT进行了定性比较,以评估实际应用效果。