Context. Risk analysis assesses potential risks in specific scenarios. Risk analysis principles are context-less; the same methodology can be applied to a risk connected to health and information technology security. Risk analysis requires a vast knowledge of national and international regulations and standards and is time and effort-intensive. A large language model can quickly summarize information in less time than a human and can be fine-tuned to specific tasks. Aim. Our empirical study aims to investigate the effectiveness of Retrieval-Augmented Generation and fine-tuned LLM in risk analysis. To our knowledge, no prior study has explored its capabilities in risk analysis. Method. We manually curated 193 unique scenarios leading to 1283 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years. We compared the base GPT-3.5 and GPT-4 models versus their Retrieval-Augmented Generation and fine-tuned counterparts. We employ two human experts as competitors of the models and three other human experts to review the models and the former human experts' analysis. The reviewers analyzed 5,000 scenario analyses. Results and Conclusions. Human experts demonstrated higher accuracy, but LLMs are quicker and more actionable. Moreover, our findings show that RAG-assisted LLMs have the lowest hallucination rates, effectively uncovering hidden risks and complementing human expertise. Thus, the choice of model depends on specific needs, with FTMs for accuracy, RAG for hidden risks discovery, and base models for comprehensiveness and actionability. Therefore, experts can leverage LLMs as an effective complementing companion in risk analysis within a condensed timeframe. They can also save costs by averting unnecessary expenses associated with implementing unwarranted countermeasures.
翻译:背景。风险分析评估特定情境下的潜在风险。风险分析原则具有普适性;同一方法论可应用于健康与信息技术安全等不同领域的风险分析。风险分析需要掌握大量国内国际法规与标准知识,且耗时耗力。大语言模型能以远快于人类的速度汇总信息,并可针对特定任务进行微调。目标。我们的实证研究旨在探究检索增强生成与微调大语言模型在风险分析中的有效性。据我们所知,尚无先前研究探索其在此领域的能力。方法。我们从工业环境团队过去五年存档的50余项关键任务分析中,手动整理出193个独特场景,并生成1283个代表性样本。我们比较了基础GPT-3.5和GPT-4模型与其检索增强生成及微调版本。我们聘请两位人类专家作为模型的对比基准,并另聘三位专家审阅模型及前两位专家的分析结果。评审人员共分析了5000个场景分析案例。结果与结论。人类专家表现出更高的准确度,但大语言模型更快速且更具可操作性。此外,研究发现检索增强生成辅助的大语言模型具有最低的幻觉率,能有效发现潜在风险并补充人类专业知识。因此,模型选择需依据具体需求:追求准确度时选用微调模型,探索潜在风险时采用检索增强生成模型,而基础模型在全面性与可操作性方面表现突出。综上,专家可在紧凑的时间框架内将大语言模型作为风险分析的有效补充工具,并通过避免实施不必要的应对措施来节约成本。