Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
翻译:语言模型随着规模增大,既展现出量化提升,也涌现出新的质性能力。尽管这些新能力可能带来变革性影响,但目前对其刻画仍不充分。为引导未来研究、应对颠覆性新模型能力的出现,并减轻社会危害效应,理解语言模型当前及近期能力与局限至关重要。针对这一挑战,我们提出了超越模仿游戏基准(BIG-bench)。BIG-bench目前包含204项任务,由来自132个机构的450位作者贡献。任务主题多样,涵盖语言学、儿童发展、数学、常识推理、生物学、物理学、社会偏见、软件开发等领域。BIG-bench聚焦于被认为超出当前语言模型能力的任务。我们评估了OpenAI的GPT模型、谷歌内部密集Transformer架构以及Switch风格稀疏Transformer在BIG-bench上的表现,模型规模横跨数百万到数千亿参数。此外,一组人类专家评分员完成了所有任务,以提供强基线。研究发现包括:模型性能与校准度均随规模提升而改善,但绝对值(以及与评分员表现相比)仍不理想;不同模型类别间的表现高度相似,但稀疏性带来益处;逐渐且可预测提升的任务通常涉及大量知识或记忆成分,而在关键规模下呈现“突破性”行为的任务往往需多步骤或多组件,或采用脆弱指标;社会偏见在模糊语境设定中通常随规模增大而加剧,但可通过提示加以改善。