Large Language Models (LLM) have taken the front seat in most of the news since November 2023, when ChatGPT was introduced. After more than one year, one of the major reasons companies are resistant to adopting them is the limited confidence they have in the trustworthiness of those systems. In a study by (Baymard, 2023), ChatGPT-4 showed an 80.1% false-positive error rate in identifying usability issues on websites. A Jan. '24 study by JAMA Pediatrics found that ChatGPT has an accuracy rate of 17% percent when diagnosing pediatric medical cases (Barile et al., 2024). But then, what is "trust"? Trust is a relative, subject condition that can change based on culture, domain, individuals. And then, given a domain, how can the trustworthiness of a system be measured? In this paper, I present a systematic approach to measure trustworthiness based on a predefined ground truth, represented as a knowledge graph of the domain. The approach is a process with humans in the loop to validate the representation of the domain and to fine-tune the system. Measuring the trustworthiness would be essential for all the entities operating in critical environments, such as healthcare, defense, finance, but it would be very relevant for all the users of LLMs.
翻译:自2023年11月ChatGPT问世以来,大语言模型(LLM)占据了大多数新闻的头条位置。一年多后,企业对其采纳持抵制态度的主要原因之一,是它们对这些系统可信度的信心有限。在(Baymard, 2023)的一项研究中,ChatGPT-4在识别网站可用性问题时显示出80.1%的误报率。2024年1月《JAMA Pediatrics》的一项研究发现,ChatGPT在诊断儿科医疗案例时的准确率仅为17%(Barile et al., 2024)。那么,什么是"信任"?信任是一种相对的、主观的状态,会因文化、领域和个体的不同而变化。进而,在给定领域内,如何衡量系统的可信度?本文提出了一种基于预定义真实值(以该领域的知识图谱表示)来系统性衡量可信度的方法。该方法是一个包含人工参与验证领域表示并对系统进行微调的过程。衡量可信度对于在医疗、国防、金融等关键环境中运行的实体至关重要,同时对所有LLM用户也具有重要价值。