Large language models (LLMs) have recently achieved significant success across various application domains, garnering substantial attention from different communities. Unfortunately, even for the best LLM, many \textit{faults} still exist that LLM cannot properly predict. Such faults will harm the usability of LLMs in general and could introduce safety issues in reliability-critical systems such as autonomous driving systems. How to quickly reveal these faults in real-world datasets that LLM could face is important, but challenging. The major reason is that the ground truth is necessary but the data labeling process is heavy considering the time and human effort. To handle this problem, in the conventional deep learning testing field, test selection methods have been proposed for efficiently evaluating deep learning models by prioritizing faults. However, despite their importance, the usefulness of these methods on LLMs is unclear, and lack of exploration. In this paper, we conduct the first empirical study to investigate the effectiveness of existing fault detection methods for LLMs. Experimental results on four different tasks~(including both code tasks and natural language processing tasks) and four LLMs~(e.g., LLaMA3 and GPT4) demonstrated that simple methods such as Margin perform well on LLMs but there is still a big room for improvement. Based on the study, we further propose \textbf{MuCS}, a prompt \textbf{Mu}tation-based prediction \textbf{C}onfidence \textbf{S}moothing framework to boost the fault detection capability of existing methods. Concretely, multiple prompt mutation techniques have been proposed to help collect more diverse outputs for confidence smoothing. The results show that our proposed framework significantly enhances existing methods with the improvement of test relative coverage by up to 70.53\%.
翻译:大型语言模型(LLMs)近年来在各个应用领域取得了显著成功,引起了不同社区的广泛关注。然而,即便对于最优的LLM,仍存在许多模型无法正确预测的\textit{故障}。此类故障将普遍损害LLMs的可用性,并可能在自动驾驶系统等可靠性关键系统中引发安全隐患。如何在实际数据集中快速揭示LLM可能面临的这些故障至关重要,但也极具挑战性。其主要原因在于,考虑到时间和人力成本,虽然真实标签不可或缺,但数据标注过程却十分繁重。为解决这一问题,在传统深度学习测试领域,研究者提出了通过优先检测故障来高效评估深度学习模型的测试选择方法。然而,尽管这些方法具有重要意义,但其在LLMs上的有效性尚不明确,且缺乏深入探索。本文开展了首个实证研究,以探究现有故障检测方法在LLMs上的效能。在四项不同任务(包括代码任务与自然语言处理任务)及四种LLM(如LLaMA3与GPT4)上的实验结果表明,简单方法(如Margin)在LLMs上表现良好,但仍有巨大改进空间。基于此研究,我们进一步提出\textbf{MuCS}——一种基于提示\textbf{变}异的预测\textbf{置}信度\textbf{平}滑框架,以提升现有方法的故障检测能力。具体而言,我们提出了多种提示变异技术,以帮助收集更多样化的输出进行置信度平滑。实验结果显示,我们提出的框架显著增强了现有方法,测试相对覆盖率最高提升达70.53\%。