Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people's work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.
翻译:过去几年中,以ChatGPT为代表的大语言模型凭借其卓越的对话能力与智能水平,迅速渗透到人们的工作与日常生活中。ChatGPT已成为人类历史上用户增长最快的软件,并成为下一代人工智能应用的重要基础模型。然而,大语言模型的生成内容并非完全可靠,常出现事实错误、偏见及有害内容。鉴于其庞大的用户基数与广泛的应用场景,这些不可靠的回应可能引发诸多严重的负面影响。本论文介绍了博士期间在语言模型可靠性领域的探索性工作,从软件测试与自然语言处理双重视角聚焦大语言模型的正确性、无毒性与公平性。首先,为衡量大语言模型的正确性,我们提出了FactChecker与LogicAsker两个测试框架,分别用于评估事实知识与逻辑推理的准确性。其次,针对大语言模型的无毒性,我们提出了两项针对大语言模型的红队测试工作。第三,为评估大语言模型的公平性,我们构建了BiasAsker与XCulturalBench两个评估框架,分别用于度量大语言模型的社会偏见与文化偏见。