Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution extrapolates to others, Perplexity Analysis for Language Model Assessment (Paloma), measures LM fit to 585 text domains, ranging from nytimes.com to r/depression on Reddit. We invite submissions to our benchmark and organize results by comparability based on compliance with guidelines such as removal of benchmark contamination from pretraining. Submissions can also record parameter and training token count to make comparisons of Pareto efficiency for performance as a function of these measures of cost. We populate our benchmark with results from 6 baselines pretrained on popular corpora. In case studies, we demonstrate analyses that are possible with Paloma, such as finding that pretraining without data beyond Common Crawl leads to inconsistent fit to many domains.
翻译:语言模型(LMs)通常报告从训练集中保留的单一数据块上的困惑度。这些数据隐含或明确地由多个领域组成——即语言的不同分布。语言模型评估的困惑度分析(Paloma)不假设单一分布的困惑度可推广至其他分布,而是通过测量LM对585个文本领域(涵盖从nytimes.com到Reddit的r/depression社区)的拟合度进行评估。我们邀请各方提交至本基准,并根据是否遵循指南(如从预训练中移除基准污染)进行可比性结果整理。提交内容还可记录参数规模和训练token数量,以便基于这些成本指标进行Pareto效率比较。我们使用基于流行语料库预训练的6个基线模型填充了基准结果。通过案例研究,我们展示了Paloma所能实现的分析,例如发现仅依赖Common Crawl数据而无其他来源的预训练会导致对许多领域的拟合度不一致。