Many adversarial attacks in NLP perturb inputs to produce visually similar strings ('ergo' $\rightarrow$ '$\epsilon$rgo') which are legible to humans but degrade model performance. Although preserving legibility is a necessary condition for text perturbation, little work has been done to systematically characterize it; instead, legibility is typically loosely enforced via intuitions around the nature and extent of perturbations. Particularly, it is unclear to what extent can inputs be perturbed while preserving legibility, or how to quantify the legibility of a perturbed string. In this work, we address this gap by learning models that predict the legibility of a perturbed string, and rank candidate perturbations based on their legibility. To do so, we collect and release LEGIT, a human-annotated dataset comprising the legibility of visually perturbed text. Using this dataset, we build both text- and vision-based models which achieve up to $0.91$ F1 score in predicting whether an input is legible, and an accuracy of $0.86$ in predicting which of two given perturbations is more legible. Additionally, we discover that legible perturbations from the LEGIT dataset are more effective at lowering the performance of NLP models than best-known attack strategies, suggesting that current models may be vulnerable to a broad range of perturbations beyond what is captured by existing visual attacks. Data, code, and models are available at https://github.com/dvsth/learning-legibility-2023.
翻译:许多NLP中的对抗性攻击会扰动输入,生成视觉上相似的字符串(如 'ergo' → 'εrgo'),这些字符串对人类可读但会降低模型性能。尽管保持可读性是文本扰动的必要条件,但鲜有工作对其进行系统表征;相反,可读性通常通过关于扰动性质和程度的直觉来松散地强制执行。特别地,目前尚不清楚在保持可读性的前提下输入能被扰动到何种程度,以及如何量化扰动字符串的可读性。在本工作中,我们通过学习模型来预测扰动字符串的可读性,并根据可读性对候选扰动进行排序,从而填补这一空白。为此,我们收集并发布了LEGIT数据集,这是一个包含视觉扰动文本可读性的人类标注数据集。利用该数据集,我们构建了基于文本和基于视觉的模型,在预测输入是否可读时达到了高达$0.91$的F1分数,在预测两个给定扰动中哪个更可读时达到了$0.86$的准确率。此外,我们发现LEGIT数据集中的可读扰动在降低NLP模型性能方面比已知的最佳攻击策略更有效,这表明当前模型可能容易受到超出现有视觉攻击所涵盖范围的广泛扰动的影响。数据、代码和模型可在 https://github.com/dvsth/learning-legibility-2023 获取。