Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.
翻译:大型语言模型日益广泛地应用于创意写作与互动内容生成,其输出内容的安全性引发关注。为此,本研究以幽默生成为测试平台,通过联合评估幽默性、刻板印象程度与毒性,考察现代LLM流程中的趣味性优化如何与有害内容产生耦合。我们进一步通过信息论指标分析其中的不一致性信号。在六个模型的测试中,我们观察到有害输出会获得更高的幽默评分,且在基于角色的提示下该现象更为显著,这表明生成器与评估器之间存在偏见放大循环。信息论分析显示,有害线索会扩大预测不确定性,且令人惊讶的是,对于某些模型甚至会使有害笑点更易被预测,这暗示了有害内容在习得的幽默分布中存在结构性嵌入。在额外的讽刺生成任务中,通过人类感知趣味性判断进行的外部验证表明,LLM生成的讽刺内容会增强刻板印象并通常提高毒性,闭源模型亦不例外。量化数据显示:具有刻板印象/毒性的笑话平均幽默评分提升$10-21\%$;在基于LLM的指标判定为有趣的笑话中,具有刻板印象的笑话出现频率高出$11\%$至$28\%$;在人类感知为有趣的生成结果中,此类笑话出现频率最多可高出$10\%$。