Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Large Language Models (LLMs) are increasingly trained to align with human values, primarily focusing on task level, i.e., refusing to execute directly harmful tasks. However, a subtle yet crucial content-level ethical question is often overlooked: when performing a seemingly benign task, will LLMs -- like morally conscious human beings -- refuse to proceed when encountering harmful content in user-provided material? In this study, we aim to understand this content-level ethical question and systematically evaluate its implications for mainstream LLMs. We first construct a harmful knowledge dataset (i.e., non-compliant with OpenAI's usage policy) to serve as the user-supplied harmful content, with 1,357 entries across ten harmful categories. We then design nine harmless tasks (i.e., compliant with OpenAI's usage policy) to simulate the real-world benign tasks, grouped into three categories according to the extent of user-supplied content required: extensive, moderate, and limited. Leveraging the harmful knowledge dataset and the set of harmless tasks, we evaluate how nine LLMs behave when exposed to user-supplied harmful content during the execution of benign tasks, and further examine how the dynamics between harmful knowledge categories and tasks affect different LLMs. Our results show that current LLMs, even the latest GPT-5.2 and Gemini-3-Pro, often fail to uphold human-aligned ethics by continuing to process harmful content in harmless tasks. Furthermore, external knowledge from the ``Violence/Graphic'' category and the ``Translation'' task is more likely to elicit harmful responses from LLMs. We also conduct extensive ablation studies to investigate potential factors affecting this novel misuse vulnerability. We hope that our study could inspire enhanced safety measures among stakeholders to mitigate this overlooked content-level ethical risk.

翻译：大语言模型（LLMs）正日益被训练以与人类价值观对齐，主要关注任务层面，即拒绝执行直接有害的任务。然而，一个微妙但至关重要的内容层面伦理问题常被忽视：在执行看似良性的任务时，LLMs是否会像具有道德意识的人类一样，在遇到用户提供材料中的有害内容时拒绝继续？本研究旨在理解这一内容层面的伦理问题，并系统评估其对主流LLMs的影响。我们首先构建了一个有害知识数据集（即不符合OpenAI使用政策）作为用户提供的有害内容，包含十个有害类别的1,357个条目。随后，我们设计了九个无害任务（即符合OpenAI使用政策）以模拟现实世界中的良性任务，根据所需用户提供内容的程度分为三类：大量、适度和有限。利用该有害知识数据集和一组无害任务，我们评估了九个LLMs在执行良性任务时暴露于用户提供有害内容时的行为，并进一步研究了有害知识类别与任务之间的动态关系如何影响不同的LLMs。我们的结果表明，当前的LLMs，即使是最新的GPT-5.2和Gemini-3-Pro，也经常无法坚持与人类对齐的伦理准则，在无害任务中继续处理有害内容。此外，来自“暴力/图像”类别的外部知识和“翻译”任务更可能引发LLMs的有害响应。我们还进行了广泛的消融研究，以探讨影响这种新型滥用漏洞的潜在因素。我们希望我们的研究能够激发利益相关者加强安全措施，以减轻这一被忽视的内容层面伦理风险。