Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.
翻译:标准作业程序(SOPs)对企业运营至关重要,然而现有语言模型在SOP理解与跨领域泛化方面存在困难。现有方法失败的原因在于联合训练无法区分SOP所需的三类推理能力:术语精确性、顺序排列和约束推理。我们提出FM SO.P框架,通过两项创新解决这些挑战。首先,我们引入渐进式任务混合方法,通过三种任务类型分阶段构建能力,并采用累积数据策略:用于术语精确性的概念消歧、用于流程正确性的动作序列理解,以及用于条件逻辑的场景感知图推理。其次,我们提出一个自动多智能体评估系统,包含三个能自适应生成评估标准、分层测试集并进行标准评分的智能体,可适配不同领域(例如车管所的时间约束、银行业的监管合规)。在涵盖七个领域(银行、车管所、医疗、市场、大学、图书馆、酒店)的SOPBench基准测试中,FM SO.P的32B模型达到48.3%通过率,开源7B模型达到34.3%通过率,以仅十分之一的参数量匹配了Qwen-2.5-72B-Instruct基线模型(34.4%)的性能。