There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.
翻译:针对语言模型在多种风险与特性方面的评估日益受到关注。依靠自然语言理解进行评分的评估,往往可以通过使用其他语言模型实现规模化运作。我们测试了这些模型分级评估在不同数据集(包括新构建的欺骗性评估数据集)中对注入攻击的鲁棒性。这种注入攻击模拟了被测模型与评估模型之间的直接通信,旨在改变其评分结果。我们推断,未来更智能的模型可能会操纵或勾结其评估模型。研究发现,在所有测试的评估任务中,当前最先进的商业模型对这些注入攻击表现出显著的敏感性。此外,类似的注入攻击还可被用于自动化可解释性框架,生成具有误导性的模型自述解释。这些成果为未来研究提供了启示,并应警惕对评估结果与自动化可解释性的无条件信任。