The Earth is Flat? Unveiling Factual Errors in Large Language Models

Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by test data leakage or the need for extensive human labor, hindering efficient and accurate error detection. To tackle this problem, we introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs. This framework involves three main steps: First, it constructs a factual knowledge graph by retrieving fact triplets from a large-scale knowledge database. Then, leveraging the knowledge graph, FactChecker employs a rule-based approach to generates three types of questions (Yes-No, Multiple-Choice, and WH questions) that involve single-hop and multi-hop relations, along with correct answers. Lastly, it assesses the LLMs' responses for accuracy using tailored matching strategies for each question type. Our extensive tests on six prominent LLMs, including text-davinci-002, text-davinci-003, ChatGPT~(gpt-3.5-turbo, gpt-4), Vicuna, and LLaMA-2, reveal that FactChecker can trigger factual errors in up to 45\% of questions in these models. Moreover, we demonstrate that FactChecker's test cases can improve LLMs' factual accuracy through in-context learning and fine-tuning (e.g., llama-2-13b-chat's accuracy increase from 35.3\% to 68.5\%). We are making all code, data, and results available for future research endeavors.

翻译：大型语言模型（LLMs）如ChatGPT，因其通过预训练和微调获得的广泛知识，成为各类应用的基础。尽管如此，它们仍容易产生事实性和常识性错误，在医疗、新闻和教育等关键领域可能误导用户。当前评估LLMs真实性的方法受限于测试数据泄露或需要大量人工劳动，难以实现高效准确的错误检测。为解决此问题，我们提出一种新颖的自动测试框架FactChecker，旨在揭示LLMs中的事实不准确性。该框架包含三个主要步骤：首先，通过从大规模知识数据库中检索事实三元组构建事实知识图谱。然后，利用该知识图谱，FactChecker采用基于规则的方法生成三种类型的问题（是非题、选择题和特殊疑问句），涉及单跳和多跳关系，并附带正确答案。最后，针对每种问题类型，使用定制化匹配策略评估LLMs响应的准确性。我们在六个知名LLMs（包括text-davinci-002、text-davinci-003、ChatGPT~（gpt-3.5-turbo、gpt-4）、Vicuna和LLaMA-2）上的广泛测试显示，FactChecker可在这些模型中触发高达45%的问题出现事实错误。此外，我们证明FactChecker的测试用例可通过情境学习和微调（例如，llama-2-13b-chat的准确率从35.3%提升至68.5%）提升LLMs的事实准确性。我们将公开所有代码、数据和结果，以支持未来研究。

相关内容