Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) Cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google flan-t5-xxl and Mixtral 8x7B.
翻译:表格数据的高效处理在众多行业中至关重要,尤其是在处理包含大量列的数据集时。大型语言模型(LLM)通过精心设计的提示,在多项任务中展示了其能力。然而,由于表格数据的结构化特性以及需要管理众多列,为表格数据集创建有效提示颇具挑战。本文提出了一种创新的自动提示生成系统,适用于多种LLM,且仅需极少的训练。系统提出了两种新方法:1)基于强化学习的算法,用于识别与任务相关的列并对其进行排序;2)基于单元格级相似性的方法,用于增强少样本示例的选择。我们的方法已在66个数据集上进行了广泛测试,在使用两种不同LLM(Google flan-t5-xxl和Mixtral 8x7B)的三个下游任务(数据填充、错误检测和实体匹配)中均展现了性能提升。