The recent advances in natural language processing (NLP), have led to a new trend of applying large language models (LLMs) to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form (LF) generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building LF generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1,612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer off-the-shelf judgment models for automatically assessing the LF output of LLMs given benchmark questions. Moreover, we compare potential solutions for LF generation evaluation and provide insights for building better automated metrics.
翻译:自然语言处理(NLP)的最新进展催生了将大型语言模型(LLM)应用于现实场景的新趋势。尽管最新LLM在与人类交互时具有惊人的流畅性,但它们仍面临虚假信息问题,即无意中生成与事实不符的错误陈述。这在医疗保健等敏感情境下可能引发严重后果。然而,此前鲜有研究关注LLM长文本生成中的虚假信息评估,尤其是针对知识密集型主题。此外,尽管LLM已被证实在不同语言中表现良好,但虚假信息评估多以英语为主。为此,我们提出CARE-MI基准,用于评估LLM在以下两个维度的虚假信息:1)敏感主题,即母婴护理领域;2)英语以外的语言,即中文。尤为重要的是,我们提供了一种构建长文本生成评估基准的创新范式,该范式可迁移至其他知识密集型领域及低资源语言。该基准填补了LLM广泛使用与缺乏评估模型生成虚假信息数据集之间的空白,包含1,612道专家核验问题及人工精选参考答案。基于该基准,我们开展大量实验,发现当前中文LLM在母婴护理主题上远非完善。为降低性能评估对人力资源的依赖,我们提供了现成的评判模型,可基于基准问题自动评估LLM的长文本输出。此外,我们比较了长文本生成评估的潜在解决方案,并为构建更优自动化指标提供了洞见。