Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

翻译：针对语言模型在多种风险与特性方面的评估日益受到关注。依靠自然语言理解进行评分的评估，往往可以通过使用其他语言模型实现规模化运作。我们测试了这些模型分级评估在不同数据集（包括新构建的欺骗性评估数据集）中对注入攻击的鲁棒性。这种注入攻击模拟了被测模型与评估模型之间的直接通信，旨在改变其评分结果。我们推断，未来更智能的模型可能会操纵或勾结其评估模型。研究发现，在所有测试的评估任务中，当前最先进的商业模型对这些注入攻击表现出显著的敏感性。此外，类似的注入攻击还可被用于自动化可解释性框架，生成具有误导性的模型自述解释。这些成果为未来研究提供了启示，并应警惕对评估结果与自动化可解释性的无条件信任。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日