With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.
翻译:随着大语言模型的兴起及其在各个领域的广泛部署,衡量语言模型在真实数据上的表现变得至关重要。例如,部署面向客户聊天机器人的公司必须确保模型不会对客户请求作出不雅回应。当前评估方法使用带有手动标注的小型领域特定数据集来解决这一问题。这些评估集通常从狭窄且简化的分布中采样,且数据源可能在不知情的情况下泄露到训练集中,从而导致误导性评估。为规避这些缺陷,我们提出了一种自监督评估框架,通过分析大语言模型对输入文本变换的敏感性或不变性来评估其表现。自监督评估可直接监测在实际环境中收集或实时部署期间流式传输的数据集上的语言模型行为。我们展示了多种自监督评估策略,用于衡量闭卷知识、毒性、长程上下文依赖,以及对语法结构和分词错误的敏感性。当与类似的人工标注基准进行比较时,我们发现自监督评估与人工监督评估之间存在强相关性。自监督范式补充了当前依赖标注数据的评估策略。