Task Contamination: Language Models May Not Be Few-Shot Anymore

Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

翻译：大型语言模型（LLMs）在各种零样本和少样本任务中展现出令人瞩目的性能。然而，它们在零样本和少样本场景中的成功可能受到任务污染的影响，这一潜在局限性尚未得到充分研究。本文探讨了LLMs的零样本和少样本性能随时间推移的变化规律。通过使用GPT-3系列模型及其他近期开源LLMs，并控制数据集难度，我们发现：在LLM训练数据创建日期之前发布的数据集上，LLMs的表现显著优于在创建日期之后发布的数据集。这强烈表明，对于许多LLMs而言，在其训练数据创建日期之前发布的数据集上，存在针对零样本和少样本评估的任务污染。此外，我们通过训练数据检查、任务示例提取和成员推断攻击，进一步获得了任务污染的证据。重要的是，我们发现，在不可能存在任务污染的分类任务中，无论是在零样本还是少样本场景下，LLMs很少能展现出比简单多数投票基线具有统计显著性的性能提升。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日