Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset is and exposes the features that make data adversarial. We then use ADVSCORE to underpin a dataset creation pipeline that incentivizes writing a high-quality adversarial dataset. As a proof of concept, we use ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from our pipeline. The high-quality questions in ADVQA surpasses three adversarial benchmarks across domains at fooling several models but not humans. We validate our result based on difficulty estimates from 9,347 human responses on four datasets and predictions from three models. Moreover, ADVSCORE uncovers which adversarial tactics used by human writers fool models (e.g., GPT-4) but not humans. Through ADVSCORE and its analyses, we offer guidance on revealing language model vulnerabilities and producing reliable adversarial examples.
翻译:对抗性基准通过提供能欺骗模型但无法欺骗人类的样本来验证模型能力。然而,尽管声称具有对抗性的数据集不断涌现,目前尚缺乏一个公认的指标来评估这些数据集的对抗性程度。为填补这一空白,我们提出了ADVSCORE指标,该指标可量化对抗性数据集的对抗性和判别性,并揭示使数据具有对抗性的特征。随后,我们利用ADVSCORE构建了一个数据集创建流程,该流程能激励编写高质量的对抗性数据集。作为概念验证,我们使用ADVSCORE从该流程中收集了一个对抗性问答(QA)数据集ADVQA。ADVQA中的高质量问题在欺骗多个模型(而非人类)方面,超越了三个跨领域的对抗性基准。我们基于9,347条人类对四个数据集的难度评估以及三个模型的预测结果验证了该结论。此外,ADVSCORE揭示了人类编写者使用的哪些对抗性策略能欺骗模型(如GPT-4)却无法欺骗人类。通过ADVSCORE及其分析,我们为揭示语言模型脆弱性和生成可靠的对抗性示例提供了指导。