Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.
翻译:大型语言模型(LLMs)在各种零样本和少样本任务中展现出令人瞩目的性能。然而,它们在零样本和少样本场景中的成功可能受到任务污染的影响,这一潜在局限性尚未得到充分研究。本文探讨了LLMs的零样本和少样本性能随时间推移的变化规律。通过使用GPT-3系列模型及其他近期开源LLMs,并控制数据集难度,我们发现:在LLM训练数据创建日期之前发布的数据集上,LLMs的表现显著优于在创建日期之后发布的数据集。这强烈表明,对于许多LLMs而言,在其训练数据创建日期之前发布的数据集上,存在针对零样本和少样本评估的任务污染。此外,我们通过训练数据检查、任务示例提取和成员推断攻击,进一步获得了任务污染的证据。重要的是,我们发现,在不可能存在任务污染的分类任务中,无论是在零样本还是少样本场景下,LLMs很少能展现出比简单多数投票基线具有统计显著性的性能提升。