We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method's ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.
翻译:我们研究了大语言模型在表格数据零样本和少样本分类中的应用。通过将表格数据序列化为自然语言字符串,并附带分类问题的简短描述,我们向大语言模型输入提示。在少样本设置中,我们使用部分标注样本对大语言模型进行微调。我们评估了多种序列化方法,包括模板、表到文本模型和大语言模型。尽管方法简单,我们发现在多个基准数据集上,该技术优于先前基于深度学习的表格分类方法。在大多数情况下,即便零样本分类也能取得非平凡性能,这体现了该方法利用大语言模型中编码的先验知识的能力。与许多针对表格数据集的深度学习方法不同,该方案在与梯度提升树等强传统基线方法比较时也具备竞争力,尤其在极少量样本场景下表现突出。