参与度损害安全性：刻板印象与毒性如何塑造语言模型的幽默生成 (Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models)

Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.

翻译：大型语言模型日益广泛地应用于创意写作与互动内容生成，其输出内容的安全性引发关注。为此，本研究以幽默生成为测试平台，通过联合评估幽默性、刻板印象程度与毒性，考察现代LLM流程中的趣味性优化如何与有害内容产生耦合。我们进一步通过信息论指标分析其中的不一致性信号。在六个模型的测试中，我们观察到有害输出会获得更高的幽默评分，且在基于角色的提示下该现象更为显著，这表明生成器与评估器之间存在偏见放大循环。信息论分析显示，有害线索会扩大预测不确定性，且令人惊讶的是，对于某些模型甚至会使有害笑点更易被预测，这暗示了有害内容在习得的幽默分布中存在结构性嵌入。在额外的讽刺生成任务中，通过人类感知趣味性判断进行的外部验证表明，LLM生成的讽刺内容会增强刻板印象并通常提高毒性，闭源模型亦不例外。量化数据显示：具有刻板印象/毒性的笑话平均幽默评分提升$10-21\%$；在基于LLM的指标判定为有趣的笑话中，具有刻板印象的笑话出现频率高出$11\%$至$28\%$；在人类感知为有趣的生成结果中，此类笑话出现频率最多可高出$10\%$。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日