The proliferation of fake news across diverse domains highlights critical limitations in current detection systems, which often exhibit narrow domain specificity and poor generalization. Existing cross-domain approaches face two key challenges: (1) reliance on labelled data, which is frequently unavailable and resource intensive to acquire and (2) information loss caused by rigid domain categorization or neglect of domain-specific features. To address these issues, we propose CoALFake, a novel approach for cross-domain fake news detection that integrates Human-Large Language Model (LLM) co-annotation with domain-aware Active Learning (AL). Our method employs LLMs for scalable, low-cost annotation while maintaining human oversight to ensure label reliability. By integrating domain embedding techniques, the CoALFake dynamically captures both domain specific nuances and cross-domain patterns, enabling the training of a domain agnostic model. Furthermore, a domain-aware sampling strategy optimizes sample acquisition by prioritizing diverse domain coverage. Experimental results across multiple datasets demonstrate that the proposed approach consistently outperforms various baselines. Our results emphasize that human-LLM co-annotation is a highly cost-effective approach that delivers excellent performance. Evaluations across several datasets show that CoALFake consistently outperforms a range of existing baselines, even with minimal human oversight.
翻译:虚假新闻在多个领域的泛滥暴露了当前检测系统的关键局限性,这些系统通常表现出狭窄的领域特异性及较差的泛化能力。现有跨域方法面临两大挑战:(1)对标注数据的依赖——此类数据常不可获取且获取过程资源密集;(2)刚性领域划分或忽视领域特有特征导致的语义信息损失。针对上述问题,我们提出CoALFake——一种融合人类与大语言模型协同标注及领域感知主动学习的跨域虚假新闻检测新方法。本方法利用大语言模型进行可扩展的低成本标注,同时保留人类监督以确保标签可靠性。通过集成领域嵌入技术,CoALFake能够动态捕获领域特有语义特征与跨域模式,从而支持领域无关模型的训练。更进一步,一种领域感知采样策略通过优先覆盖多样化领域分布来优化样本采集效率。多数据集实验表明,所提方法始终优于各类基线模型。研究结果强调,人类-大语言模型协同标注是一种兼具高成本效益与卓越性能的有效途径。跨数据集评估显示,即便在最小化人类监督干预下,CoALFake仍持续优于现有基准方法。