The recent advances in natural language processing (NLP), have led to a new trend of applying large language models (LLMs) to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form (LF) generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building LF generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1,612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer off-the-shelf judgment models for automatically assessing the LF output of LLMs given benchmark questions. Moreover, we compare potential solutions for LF generation evaluation and provide insights for building better automated metrics.
翻译:近年来,自然语言处理(NLP)的最新进展推动了大语言模型(LLMs)在现实场景中的应用趋势。尽管当前LLMs在与人类交互时展现出惊人的流畅性,但它们因无意中生成事实性错误陈述而面临错误信息问题,这在医疗等敏感语境中可能引发严重后果。然而,现有研究鲜少聚焦于评估LLMs在长文本生成中的错误信息,尤其是针对知识密集型主题。此外,尽管LLMs在多语言环境中表现良好,但错误信息评估大多局限于英文场景。为此,我们提出基准数据集CARE-MI,用于评估LLMs在以下两方面的错误信息:1)敏感领域,即母婴护理领域;2)非英语语言,即中文。更重要的是,我们提供了一种创新的长文本生成评估基准构建范式,可迁移至其他知识密集型领域及低资源语言。本基准填补了LLMs广泛应用与其生成错误信息评估数据集缺失之间的空白,包含1,612个经专家核验的问题及人工筛选的参考文献。通过该基准的大量实验发现,当前中文LLMs在母婴护理主题上的表现仍远非理想。为减少性能评估对人力资源的依赖,我们提供了现成的判别模型,可基于基准问题自动评估LLMs的长文本输出。此外,我们比较了长文本生成评估的潜在解决方案,并为构建更优的自动化指标提供了见解。