Standard Full-Data classifiers in NLP demand thousands of labeled examples, which is impractical in data-limited domains. Few-shot methods offer an alternative, utilizing contrastive learning techniques that can be effective with as little as 20 examples per class. Similarly, Large Language Models (LLMs) like GPT-4 can perform effectively with just 1-5 examples per class. However, the performance-cost trade-offs of these methods remain underexplored, a critical concern for budget-limited organizations. Our work addresses this gap by studying the aforementioned approaches over the Banking77 financial intent detection dataset, including the evaluation of cutting-edge LLMs by OpenAI, Cohere, and Anthropic in a comprehensive set of few-shot scenarios. We complete the picture with two additional methods: first, a cost-effective querying method for LLMs based on retrieval-augmented generation (RAG), able to reduce operational costs multiple times compared to classic few-shot approaches, and second, a data augmentation method using GPT-4, able to improve performance in data-limited scenarios. Finally, to inspire future research, we provide a human expert's curated subset of Banking77, along with extensive error analysis.
翻译:标准的全数据NLP分类器需要数千个标注样本,这在数据受限的领域中不切实际。少样本方法提供了一种替代方案,利用对比学习技术,每个类别仅需20个样本即可有效工作。同样,像GPT-4这样的大语言模型(LLM)每个类别仅需1-5个样本即可高效运行。然而,这些方法的性能-成本权衡仍未得到充分探索,这对预算有限的组织而言是一个关键问题。我们的工作通过研究上述方法在Banking77金融意图检测数据集上的表现(包括全面评估OpenAI、Cohere和Anthropic的最新LLM在多种少样本场景下的效果)来填补这一空白。我们通过两种额外方法完善了整体图景:第一,一种基于检索增强生成(RAG)的LLM低成本查询方法,相比经典少样本方法可将运营成本降低数倍;第二,一种使用GPT-4的数据增强方法,可提升数据受限场景下的性能。最后,为启发未来研究,我们提供了人类专家精心筛选的Banking77子集,并附有详尽的错误分析。