With the development of large language models (LLMs), the sequence length of these models continues to increase, drawing significant attention to long-context language models. However, the evaluation of these models has been primarily limited to their capabilities, with a lack of research focusing on their safety. Existing work, such as ManyShotJailbreak, has to some extent demonstrated that long-context language models can exhibit safety concerns. However, the methods used are limited and lack comprehensiveness. In response, we introduce \textbf{LongSafetyBench}, the first benchmark designed to objectively and comprehensively evaluate the safety of long-context models. LongSafetyBench consists of 10 task categories, with an average length of 41,889 words. After testing eight long-context language models on LongSafetyBench, we found that existing models generally exhibit insufficient safety capabilities. The proportion of safe responses from most mainstream long-context LLMs is below 50\%. Moreover, models' safety performance in long-context scenarios does not always align with that in short-context scenarios. Further investigation revealed that long-context models tend to overlook harmful content within lengthy texts. We also proposed a simple yet effective solution, allowing open-source models to achieve performance comparable to that of top-tier closed-source models. We believe that LongSafetyBench can serve as a valuable benchmark for evaluating the safety capabilities of long-context language models. We hope that our work will encourage the broader community to pay attention to the safety of long-context models and contribute to the development of solutions to improve the safety of long-context LLMs.
翻译:随着大语言模型(LLM)的发展,其序列长度持续增加,使得长上下文语言模型受到广泛关注。然而,现有评估主要集中于模型的能力,缺乏对其安全性的系统研究。已有工作(如ManyShotJailbreak)在一定程度上揭示了长上下文语言模型可能存在的安全隐患,但所采用的方法较为局限且不够全面。为此,我们提出了首个旨在客观、全面评估长上下文模型安全性的基准——**LongSafetyBench**。该基准包含10个任务类别,平均文本长度达41,889词。通过对八款长上下文语言模型在LongSafetyBench上的测试,我们发现现有模型普遍存在安全能力不足的问题:多数主流长上下文LLM的安全响应比例低于50%。此外,模型在长上下文场景下的安全表现并不总是与短上下文场景一致。进一步研究表明,长上下文模型容易忽略冗长文本中的有害内容。我们还提出了一种简单而有效的解决方案,使开源模型能够达到与顶尖闭源模型相当的性能。我们相信LongSafetyBench可为评估长上下文语言模型的安全能力提供有价值的基准,并期待这项工作能推动学界更广泛地关注长上下文模型的安全问题,共同促进提升长上下文LLM安全性的解决方案发展。