With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.
翻译:随着大型语言模型(LLMs)的兴起及其在各个领域的广泛应用,在真实数据上测量语言模型行为变得至关重要。例如,部署面向客户聊天机器人的公司必须确保模型不会用污言秽语回应客户请求。当前评估方法使用带有专家标注的小规模领域特定数据集来解决这一问题。这些评估集通常从狭窄且简化的分布中采样,且数据源可能无意中泄露到训练集中,从而导致误导性评估。为规避这些缺陷,我们提出了一种LLMs自监督评估框架,通过分析模型对输入文本变换的敏感性或不变性进行评测。自监督评估可直接监控从真实环境中采集或在模型实时部署过程中流式传入的数据集上的LLM行为。我们展示了用于测量闭卷知识、毒性、长程上下文依赖以及语法结构和分词错误敏感性的自监督评估策略。当存在类似的人工标注基准可供比较时,我们发现自监督评估与人类监督评估之间存在强相关性。该自监督范式补充了当前依赖标注数据的评估策略。