Perplexity -- a function measuring a model's overall level of "surprise" when encountering a particular output -- has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently -- a necessary pre-requisite for strong generalisation -- it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model -- rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.
翻译:困惑度——一种衡量模型在遇到特定输出时整体“惊讶”程度的函数——近年来获得了显著关注,既作为损失函数,也作为一种易于计算的模型质量度量指标。先前的研究已从经验角度指出了困惑度的若干局限性。本文利用Transformer连续性方面的最新成果,以严格的方式论证了困惑度为何可能不适合作为模型选择的指标。具体而言,我们证明:如果存在某个序列,一个紧凑的仅解码器Transformer模型能够准确且自信地预测它(这是实现强泛化能力的必要前提),那么必然意味着存在另一个序列,其困惑度极低,但同一模型却无法正确预测它。此外,通过对等困惑度曲线的解析研究,我们发现困惑度并不总是选择更准确的模型——相反,模型置信度的任何提升都必须伴随着准确度的相应提高,新模型才有可能被选中。