FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper, we propose FactoryLLM, a safe and open-source AI playground designed for evaluating different LLM-based retrieval-augmented generation (RAG) models by analysing documents from multiple machines across the manufacturing process. FactoryLLM enables the user to configure the LLM, and assess performance when reasoning over multiple documents, through a dual evaluation setup using both RAGAS and NVIDIA's LLM-as-a-Judge metrics. FactoryLLM is safe because it allows users to run local or open-source LLMs without sharing sensitive industrial data, providing a controlled environment for experimentation. We demonstrate the efficacy of FactoryLLM through a case study which involves an Autonomous Intelligent Vehicle and its Mobile Planner software, evaluating three LLMs across 30 maintenance queries derived from approximately 600 pages of cross-machine documentation. The results suggest that FactoryLLM is effective in cross-machine document reasoning: every model achieved a groundedness score above 0.88. The full code and documentation for community to test FactoryLLM with their manufacturing specific scenarios are publicly available.

翻译：智能工厂中的故障诊断与恢复极具挑战性，因为关键信息分散在通过制造流程相互关联的多个机器手册中。大语言模型（LLMs）为此提供了可行方案。本文提出FactoryLLM——一个安全开源的AI测试平台，旨在通过分析制造流程中多台机器的文档，评估基于检索增强生成（RAG）的不同LLM模型。该平台允许用户配置LLM，并通过RAGAS与NVIDIA的LLM-as-a-Judge双重评估机制，测试模型在多文档推理中的性能表现。FactoryLLM的安全性体现在其支持运行本地或开源LLM，无需共享敏感工业数据，为实验提供受控环境。我们通过涉及自主智能车辆及其移动规划器软件的案例研究验证了该平台的有效性：基于约600页跨机器文档生成的30个维护查询，对三种LLM进行了评估。结果表明FactoryLLM在跨机器文档推理中表现卓越——所有模型的基础事实得分均超过0.88。社区可在公开代码与文档支持下，针对特定制造场景测试FactoryLLM。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

29+阅读 · 4月6日

从静态模板到动态运行时图：大语言模型智能体（LLM Agents）工作流优化综述

专知会员服务

23+阅读 · 3月30日

LLM/智能体作为数据分析师：综述

专知会员服务

38+阅读 · 2025年9月30日

大型语言模型（LLM）智能体全栈安全的综述：数据、训练与部署

专知会员服务

33+阅读 · 2025年4月23日