The recent advances in natural language processing (NLP), have led to a new trend of applying large language models (LLMs) to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form (LF) generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building LF generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1,612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer off-the-shelf judgment models for automatically assessing the LF output of LLMs given benchmark questions. Moreover, we compare potential solutions for LF generation evaluation and provide insights for building better automated metrics.
翻译:近期自然语言处理(NLP)领域的进展引发了将大语言模型(LLMs)应用于现实场景的新趋势。尽管最新LLMs在与人类交互时展现出惊人流畅度,但其仍面临虚假信息问题——无意中生成与事实不符的陈述。这一缺陷在医疗健康等敏感情境下可能导致严重后果。然而,现有研究鲜少聚焦于评估LLMs在长文本生成中的虚假信息,尤其是针对知识密集型主题。此外,尽管LLMs已被证明在不同语言中表现优异,但虚假信息评估大多仅局限于英语场景。为此,我们提出CARE-MI基准,旨在从以下两个维度评估LLM虚假信息:1)敏感领域,即母婴护理专业领域;2)非英语语言,即中文语境。更重要的是,我们提出了构建长文本生成评估基准的创新范式,该范式可迁移至其他知识密集型领域及低资源语言。本基准填补了LLMs广泛使用与其虚假信息评估数据集缺失之间的鸿沟,包含1,612道经专家核验的问题及配套人工筛选参考文献。通过该基准开展的大量实验表明,当前中文LLMs在母婴护理主题上的表现远未臻完善。为减少人工评估成本,我们提供即用型评判模型,可基于基准问题自动评估LLMs的长文本输出。此外,我们比较了长文本生成评估的潜在解决方案,为构建更优自动化评估指标提供了洞见。