The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts rises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content have also been evaluated and mostly found usable. However, a combination of personalization and disinformation abilities of LLMs has not been comprehensively studied yet. Such a dangerous combination should trigger integrated safety filters of the LLMs, if there are some. This study fills this gap by evaluation of vulnerabilities of recent open and closed LLMs, and their willingness to generate personalized disinformation news articles in English. We further explore whether the LLMs can reliably meta-evaluate the personalization quality and whether the personalization affects the generated-texts detectability. Our results demonstrate the need for stronger safety-filters and disclaimers, as those are not properly functioning in most of the evaluated LLMs. Additionally, our study revealed that the personalization actually reduces the safety-filter activations; thus effectively functioning as a jailbreak. Such behavior must be urgently addressed by LLM developers and service providers.
翻译:近期大型语言模型(LLMs)生成高质量内容的能力已达到人类难以将其与人工撰写文本区分的程度,这引发了对其滥用的诸多担忧。先前研究表明,LLMs可被有效滥用于根据预设叙事生成虚假新闻文章。其在多维度生成个性化内容的能力亦经评估且大多被证实可用。然而,LLMs的个性化能力与虚假信息生成能力的结合尚未得到系统研究。若存在相应安全机制,这种危险组合本应触发LLMs的集成安全过滤器。本研究通过评估近期开源与闭源LLMs的脆弱性及其生成英文个性化虚假新闻文章的倾向性填补了这一空白。我们进一步探究了LLMs能否可靠地对个性化质量进行元评估,以及个性化是否影响生成文本的可检测性。研究结果表明当前亟需强化安全过滤器与免责声明机制,因大多数受评估LLMs中现有功能均未正常运作。此外,本研究发现个性化实际上会降低安全过滤器的激活率,从而实质上起到越狱作用。LLM开发者与服务提供商必须紧急应对此类行为。