To ensure and monitor large language models (LLMs) reliably, various evaluation metrics have been proposed in the literature. However, there is little research on prescribing a methodology to identify a robust threshold on these metrics even though there are many serious implications of an incorrect choice of the thresholds during deployment of the LLMs. Translating the traditional model risk management (MRM) guidelines within regulated industries such as the financial industry, we propose a step-by-step recipe for picking a threshold for a given LLM evaluation metric. We emphasize that such a methodology should start with identifying the risks of the LLM application under consideration and risk tolerance of the stakeholders. We then propose concrete and statistically rigorous procedures to determine a threshold for the given LLM evaluation metric using available ground-truth data. As a concrete example to demonstrate the proposed methodology at work, we employ it on the Faithfulness metric, as implemented in various publicly available libraries, using the publicly available HaluBench dataset. We also lay a foundation for creating systematic approaches to select thresholds, not only for LLMs but for any GenAI applications.
翻译:为确保和可靠监控大型语言模型(LLMs),文献中已提出多种评估指标。然而,尽管在LLMs部署过程中阈值选择不当会产生诸多严重后果,目前仍鲜有研究提出确定这些指标稳健阈值的方法论。借鉴金融业等受监管行业的传统模型风险管理(MRM)准则,我们提出了一套为给定LLM评估指标选择阈值的分步方案。我们强调,此类方法应始于识别所考察LLM应用的风险及利益相关者的风险承受能力。随后,我们提出具体且统计严谨的程序,利用现有真实数据确定给定LLM评估指标的阈值。为具体演示所提方法的应用,我们将其运用于忠实度指标——该指标在多个公开可用的库中实现,并采用公开可用的HaluBench数据集进行验证。本研究不仅为LLMs,也为任何生成式人工智能应用建立系统化阈值选择方法奠定了基础。