Classifying scanned documents is a challenging problem that involves image, layout, and text analysis for document understanding. Nevertheless, for certain benchmark datasets, notably RVL-CDIP, the state of the art is closing in to near-perfect performance when considering hundreds of thousands of training samples. With the advent of large language models (LLMs), which are excellent few-shot learners, the question arises to what extent the document classification problem can be addressed with only a few training samples, or even none at all. In this paper, we investigate this question in the context of zero-shot prompting and few-shot model fine-tuning, with the aim of reducing the need for human-annotated training samples as much as possible.
翻译:扫描文档分类是一个涉及图像、布局与文本分析以实现文档理解的复杂问题。然而,对于某些基准数据集(特别是RVL-CDIP),当使用数十万训练样本时,现有技术已接近实现完美性能。随着大语言模型(LLMs)——这种优秀的少样本学习器——的出现,文档分类问题能在多大程度上通过少量训练样本甚至零样本得到解决,成为一个值得探讨的问题。本文在零样本提示与少样本模型微调的框架下研究该问题,旨在尽可能减少对人类标注训练样本的依赖。