The advancement of large language models has significantly improved natural language processing. However, challenges such as jailbreaks (prompt injections that cause an LLM to follow instructions contrary to its intended use), hallucinations (generating incorrect or misleading information), and comprehension errors remain prevalent. In this report, we present a comparative analysis of the performance of fifteen distinct models, with each model undergoing a standardized test comprising 38 queries across three key metrics: jailbreaks, hallucinations, and comprehension errors. The models are assessed based on the total occurrences of jailbreaks, hallucinations, and comprehension errors. Our work exposes these models' inherent vulnerabilities and challenges the notion of human-level language comprehension of these models. We have empirically analysed the impact of non-standard Unicode characters on LLMs and their safeguarding mechanisms on the best-performing LLMs, including GPT-4, Gemini 1.5 Pro, LlaMA-3-70B, and Claude 3 Opus. By incorporating alphanumeric symbols from Unicode outside the standard Latin block and variants of characters in other languages, we observed a reduction in the efficacy of guardrails implemented through Reinforcement Learning Human Feedback (RLHF). Consequently, these models exhibit heightened vulnerability to content policy breaches and prompt leakage. Our study also suggests a need to incorporate non-standard Unicode text in LLM training data to enhance the capabilities of these models.
翻译:大型语言模型的进展显著提升了自然语言处理能力。然而,越狱(使LLM违背其预设用途执行指令的提示注入)、幻觉(生成错误或误导性信息)与理解错误等挑战依然普遍存在。本报告对十五个不同模型进行了性能对比分析,每个模型均接受包含38个查询的标准化测试,涵盖越狱、幻觉和理解错误三个关键指标。模型评估基于越狱、幻觉和理解错误的总发生次数。我们的研究揭示了这些模型固有的脆弱性,并对它们达到人类水平语言理解能力的观点提出质疑。我们通过实证分析了非标准Unicode字符对LLM及其安全防护机制的影响,测试对象包括性能最优的GPT-4、Gemini 1.5 Pro、LlaMA-3-70B和Claude 3 Opus等模型。通过引入标准拉丁字符块之外的Unicode字母数字符号及其他语言的字符变体,我们观察到通过人类反馈强化学习(RLHF)实施的安全护栏效能降低。因此,这些模型对内容政策违规和提示泄漏表现出更高的脆弱性。我们的研究还表明,有必要在LLM训练数据中纳入非标准Unicode文本以增强模型能力。