Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
翻译:大型语言模型(LLMs)正日益广泛地部署于各领域,这催生了对严格可靠性评估方法的需求。现有基于基准测试的评估主要提供模型在数据集上准确率的描述性统计,对LLMs在实际运行条件下的概率行为揭示有限。本文提出HIP-LLM,一种用于建模和推断LLM可靠性的分层非精确概率框架。基于软件可靠性工程的基础,HIP-LLM将LLM可靠性定义为在给定运行剖面(OP)下,未来执行特定数量任务时无故障运行的概率。HIP-LLM以分层方式表示(子)领域间的依赖关系,支持从子领域到系统级可靠性的多层级推断。该框架嵌入非精确先验以捕捉认知不确定性,并纳入运行剖面以反映使用情境。它推导出量化先验与数据不确定性的后验可靠性包络。在多个基准数据集上的实验表明,相较于现有基准方法和前沿技术,HIP-LLM提供了更准确且标准化的可靠性表征。本文提供了可公开访问的HIP-LLM代码库。