Large Language Models (LLMs) undergo continuous updates to improve user experience. However, prior research on the security and safety implications of LLMs has primarily focused on their specific versions, overlooking the impact of successive LLM updates. This prompts the need for a holistic understanding of the risks in these different versions of LLMs. To fill this gap, in this paper, we conduct a longitudinal study to examine the adversarial robustness -- specifically misclassification, jailbreak, and hallucination -- of three prominent LLM families: GPT, Llama, and Qwen. Our study reveals that LLM updates do not consistently improve adversarial robustness as expected. For instance, a later version of GPT-3.5 degrades regarding misclassification and hallucination despite its improved resilience against jailbreaks. GPT-4 and GPT-4o demonstrate (incrementally) higher robustness overall. Larger Llama and Qwen models do not uniformly exhibit improved robustness across all three aspects studied. In addition, larger model sizes do not necessarily yield improved robustness. Minor updates lacking substantial robustness improvements can exacerbate existing issues rather than resolve them. We hope our study can offer valuable insights into navigating model updates and informed decisions in model development and usage.
翻译:大型语言模型(LLMs)持续更新以提升用户体验。然而,先前关于LLMs安全性和安全性影响的研究主要聚焦于特定版本,忽视了连续LLM更新的影响。这促使我们需要全面理解这些不同版本LLMs中的风险。为填补这一空白,本文通过纵向研究考察了三个主流LLM系列——GPT、Llama和Qwen——在对抗鲁棒性(具体包括误分类、越狱和幻觉)方面的表现。研究发现,LLM更新并未如预期般持续提升对抗鲁棒性。例如,GPT-3.5的后续版本虽然在抵御越狱攻击方面有所增强,但在误分类和幻觉方面表现反而退化。GPT-4和GPT-4o整体展现出(渐进式)更高的鲁棒性。更大规模的Llama和Qwen模型在所研究的三个维度上并未一致表现出鲁棒性提升。此外,更大的模型规模未必带来鲁棒性的改善。缺乏实质性鲁棒性提升的微小更新可能加剧而非解决现有问题。我们希望本研究能为模型更新策略以及模型开发与使用中的决策提供有价值的见解。