The recent advances in NLP, have led to a new trend of applying LLMs to real-world scenarios. While the latest LLMs are astonishingly fluent when interacting with humans, they suffer from the misinformation problem by unintentionally generating factually false statements. This can lead to harmful consequences, especially when produced within sensitive contexts, such as healthcare. Yet few previous works have focused on evaluating misinformation in the long-form generation of LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have been shown to perform well in different languages, misinformation evaluation has been mostly conducted in English. To this end, we present a benchmark, CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, specifically the maternity and infant care domain; and 2) a language other than English, namely Chinese. Most importantly, we provide an innovative paradigm for building long-form generation evaluation benchmarks that can be transferred to other knowledge-intensive domains and low-resourced languages. Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. It contains 1,612 expert-checked questions, accompanied with human-selected references. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care. In an effort to minimize the reliance on human resources for performance evaluation, we offer a judgment model for automatically assessing the long-form output of LLMs using the benchmark questions. Moreover, we compare potential solutions for long-form generation evaluation and provide insights for building more robust and efficient automated metric.
翻译:自然语言处理的最新进展,催生了将大语言模型应用于现实场景的新趋势。尽管最新的大语言模型在与人类交互时具有惊人的流畅性,但它们因无意中生成事实性错误陈述而存在虚假信息问题。这可能导致有害后果,尤其是在医疗健康等敏感语境中。然而,以往研究鲜少关注大语言模型长文本生成中的虚假信息评估,特别是针对知识密集型主题。此外,虽然大语言模型在不同语言中均表现良好,但虚假信息评估主要集中于英语。为此,我们提出基准数据集CARE-MI,用于评估大语言模型在以下方面的虚假信息:(1)敏感主题,即孕婴护理领域;(2)英语之外的语言,即中文。最重要的是,我们提供了一种构建长文本生成评估基准的创新范式,可迁移至其他知识密集型领域和低资源语言。所提出的基准填补了大语言模型广泛使用与缺乏评估这些模型生成虚假信息的数据集之间的空白。它包含1,612道经过专家核验的问题,并配有人工精选参考文献。利用该基准,我们开展了大量实验,发现当前中文大语言模型在孕婴护理主题上远未达到理想水平。为尽量减少对人力资源的依赖以进行性能评估,我们提供了一个判断模型,可利用基准问题自动评估大语言模型的长文本输出。此外,我们比较了长文本生成评估的潜在解决方案,为构建更稳健高效的自动化指标提供了见解。