Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.
翻译:大型语言模型(LLM)正越来越多地被用于自动化新闻可信度评估,然而,它们是否对不同新闻体裁采用统一标准尚不清楚。我们采用基于同一数据集(FakeNewsNet中的GossipCop)的实验设计,检验了零样本LLM是否更易将合法娱乐新闻误分类为假新闻,而非合法严肃新闻。在四个前沿模型中,我们发现了一种明确但依模型而异的体裁不对称性:DeepSeek-V3.2和GPT-5.2的假阳性率差距分别为10.1和8.8个百分点(均满足 $p < .001$),而Claude Opus 4.6和Gemini 3 Flash则未显示出可比的差异。一项风格互换实验仅产生有限且不一致的变化,表明这种不对称性不能简单归因于语体风格。基于提示的缓解策略同样可行但非通用:将模型框架设定为娱乐新闻事实核查员可使DeepSeek-V3.2的假阳性率降低约50%,且未检测到显著的召回率损失,但对GPT-5.2的改善甚微。探索性定性编码进一步揭示了采样假阳性样本中两种反复出现的错误模式:将涉及私人生活的声明视为本质不可验证,以及将娱乐新闻报道视为认识论上较弱的体裁。综合来看,这些发现表明,整体性能指标可能掩盖合法新闻内部的结构化假阳性。我们认为,基于LLM的可信度评估不仅评估事实主张,还可能差异性地认可新闻体裁的合法性,因此评估应在整体准确率之外包含按体裁分层的假阳性分析。