Large Language Models (LLMs) can revolutionize how we deploy and operate Open Radio Access Networks (O-RAN) by enhancing network analytics, anomaly detection, and code generation and significantly increasing the efficiency and reliability of a plethora of O-RAN tasks. In this paper, we present ORAN-Bench-13K, the first comprehensive benchmark designed to evaluate the performance of Large Language Models (LLMs) within the context of O-RAN. Our benchmark consists of 13,952 meticulously curated multiple-choice questions generated from 116 O-RAN specification documents. We leverage a novel three-stage LLM framework, and the questions are categorized into three distinct difficulties to cover a wide spectrum of ORAN-related knowledge. We thoroughly evaluate the performance of several state-of-the-art LLMs, including Gemini, Chat-GPT, and Mistral. Additionally, we propose ORANSight, a Retrieval-Augmented Generation (RAG)-based pipeline that demonstrates superior performance on ORAN-Bench-13K compared to other tested closed-source models. Our findings indicate that current popular LLM models are not proficient in O-RAN, highlighting the need for specialized models. We observed a noticeable performance improvement when incorporating the RAG-based ORANSight pipeline, with a Macro Accuracy of 0.784 and a Weighted Accuracy of 0.776, which was on average 21.55% and 22.59% better than the other tested LLMs.
翻译:大型语言模型(LLMs)能够通过增强网络分析、异常检测和代码生成能力,显著提升众多开放无线接入网络(O-RAN)任务的效率与可靠性,从而彻底改变我们部署和运营O-RAN的方式。本文提出了ORAN-Bench-13K,这是首个旨在评估大型语言模型在O-RAN背景下性能的综合基准。我们的基准包含13,952道精心编制的多项选择题,这些题目源自116份O-RAN规范文档。我们采用了一种新颖的三阶段LLM框架,并将问题按三种不同难度进行分类,以覆盖广泛的O-RAN相关知识。我们全面评估了包括Gemini、Chat-GPT和Mistral在内的多种前沿大型语言模型的性能。此外,我们提出了ORANSight,一个基于检索增强生成(RAG)的流程,该流程在ORAN-Bench-13K上展现出优于其他测试的闭源模型的性能。我们的研究结果表明,当前流行的大型语言模型并不精通O-RAN领域,这凸显了对专业化模型的需求。我们观察到,在引入基于RAG的ORANSight流程后,性能得到了显著提升,其宏观准确率达到0.784,加权准确率达到0.776,平均比其他测试的大型语言模型分别高出21.55%和22.59%。