Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting. However, CoT can be biased by users' instruction. In this work, we study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries. We design an Adversarial Typo Attack ($\texttt{ATA}$) algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking. It shows that LLMs are sensitive to minimal adversarial typographical changes. Notably, with 1 character edit, Mistral-7B-Instruct's accuracy drops from 43.7% to 38.6% on GSM8K, while with 8 character edits the performance further drops to 19.2%. To extend our evaluation to larger and closed-source LLMs, we develop the $\texttt{R$^2$ATA}$ benchmark, which assesses models' $\underline{R}$easoning $\underline{R}$obustness to $\underline{\texttt{ATA}}$. It includes adversarial typographical questions derived from three widely used reasoning datasets-GSM8K, BBH, and MMLU-by applying $\texttt{ATA}$ to open-source LLMs. $\texttt{R$^2$ATA}$ demonstrates remarkable transferability and causes notable performance drops across multiple super large and closed-source LLMs.
翻译:大语言模型(LLMs)通过思维链(CoT)提示在推理任务中展现出卓越能力。然而,CoT推理易受用户指令偏差的影响。本研究探讨LLMs对用户查询中自然出现的拼写错误的推理鲁棒性。我们设计了对抗性拼写攻击算法($\texttt{ATA}$),该算法迭代采样对查询关键词汇的拼写错误,并选择最可能攻击成功的编辑方案。实验表明,LLMs对微小的对抗性拼写变化极为敏感。值得注意的是,仅需1个字符编辑,Mistral-7B-Instruct在GSM8K数据集上的准确率就从43.7%降至38.6%;当编辑字符增至8个时,其性能进一步下降至19.2%。为将评估扩展至更大规模的闭源LLMs,我们开发了$\texttt{R$^2$ATA}$基准测试,用于评估模型对$\underline{\texttt{ATA}}$的$\underline{R}$推理$\underline{R}$鲁棒性。该基准包含通过对开源LLMs应用$\texttt{ATA}$算法,从三个广泛使用的推理数据集——GSM8K、BBH和MMLU——衍生出的对抗性拼写问题。$\texttt{R$^2$ATA}$展现出显著的迁移性,并在多个超大规模闭源LLMs中引发了明显的性能下降。