Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.
翻译:语言模型在维护操作中日益充当咨询系统。为防止幻觉现象,近期系统将模型约束在规程文档中,使其仅执行经批准的步骤。然而在实际应用中,操作员的查询常常偏离固定路径,要求模型在对话中途识别超出范围输入——这种动态在当前基准测试中很少被优先考虑。我们提出DiagFlowBench数据集,该数据集包含来自某消费产品制造商的50个工业诊断流程图,转化为1676轮多轮对话,其中对比了合规话语与超出范围话语。对十款商业模型与开源权重模型的评估显示,其拒绝回答率存在显著差异,模型通常选择某个实际存在但上下文不恰当的步骤,而非捏造事实。这种映射错误建议的固有合理性与权威性,揭示了地面系统面临的严峻脆弱性挑战。