In the era of widespread public use of AI systems across various domains, ensuring adversarial robustness has become increasingly vital to maintain safety and prevent undesirable errors. Researchers have curated various adversarial datasets (through perturbations) for capturing model deficiencies that cannot be revealed in standard benchmark datasets. However, little is known about how these adversarial examples differ from the original data points, and there is still no methodology to measure the intended and unintended consequences of those adversarial transformations. In this research, we conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks, among dimensions of difficulty, diversity, and disagreement. We selected several current adversarial effect datasets and compared the distributions between the original and their adversarial counterparts. The results provide valuable insights into what makes these datasets more challenging from a metrics perspective and whether they align with underlying assumptions.
翻译:在AI系统各领域广泛被公众使用的时代,确保对抗鲁棒性对维护安全、防止不良错误愈发重要。研究者通过扰动方法构建了各类对抗数据集,用以捕捉标准基准数据集无法揭示的模型缺陷。然而,这些对抗样本与原始数据点的差异机制尚未明确,目前仍缺乏衡量对抗转换预期与非预期影响的方法论。本研究系统梳理了NLP任务中描述文本实例的现存可量化指标,涵盖难度、多样性和分歧三个维度。通过选取多个现有对抗效应数据集,我们对比了原始数据与对应对抗数据的分布特征。研究结果从量化指标视角揭示了这些数据集更具挑战性的本质特征,以及其与潜在假设的契合程度。