Acquiring high-quality data is often a significant challenge in training machine learning (ML) models for tabular prediction, particularly in privacy-sensitive and costly domains like medicine and finance. Providing natural language instructions to large language models (LLMs) offers an alternative solution. However, it is unclear how effectively instructions leverage the knowledge in LLMs for solving tabular prediction problems. To address this gap, we introduce TABLET, a benchmark of 20 diverse tabular datasets annotated with instructions that vary in their phrasing, granularity, and technicality. Additionally, TABLET includes the instructions' logic and structured modifications to the instructions. We find in-context instructions increase zero-shot F1 performance for Flan-T5 11b by 44% on average and 13% for ChatGPT on TABLET. Also, we explore the limitations of using LLMs for tabular prediction in our benchmark by evaluating instruction faithfulness. We find LLMs often ignore instructions and fail to predict specific instances correctly, even with examples. Our analysis on TABLET shows that, while instructions help LLM performance, learning from instructions for tabular data requires new capabilities.
翻译:获取高质量数据通常是训练机器学习(ML)模型进行表格预测时面临的重大挑战,尤其是在医学和金融等对隐私敏感且成本高昂的领域。向大型语言模型(LLM)提供自然语言指令提供了一种替代方案。然而,指令如何有效利用LLM中的知识来解决表格预测问题尚不明确。为填补这一空白,我们提出了TABLET,这是一个包含20个多样化表格数据集的基准,这些数据集附带了在措辞、粒度和专业性上有所不同的指令。此外,TABLET还包含指令的逻辑和结构化修改。我们发现,在TABLET上,上下文指令使Flan-T5 11b的零样本F1性能平均提升44%,使ChatGPT的零样本F1性能平均提升13%。同时,我们通过评估指令的忠实度,探索了LLM在表格预测中使用的局限性。我们发现,LLM常常忽略指令,即使有示例,也无法正确预测特定实例。我们在TABLET上的分析表明,虽然指令有助于提升LLM性能,但从指令中为表格数据学习需要新的能力。