企业应用中大型语言模型鲁棒性评估：跨格式与语言的扰动一致性基准 (Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages)

Enterprise LLM applications require consistently high quality and reliable performance across diverse scenarios, demanding robustness to minor variations. Existing research shows that even small prompt changes can lead to substantial differences in output, but has mainly focused on a narrow set of perturbations with small academic datasets, limiting their relevance to real-world applications. To address this, we present a comprehensive benchmark suite that evaluates robustness across multiple perturbation types, including general text edits (e.g., punctuation, whitespace), formatting changes (e.g., JSON, YAML), multilingual and cross-lingual inputs, and positional variations in instructions. Evaluating 11 models ranging from 4B to 120B+ parameters, we find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics. Critically, we demonstrate that the relationship between model size and robustness is more nuanced than conventional assumptions suggest: an 8B parameter model (Ministral 3 8B) outperforms most larger models, while another 8B model (Llama 3.1 8B) performs worst overall.

翻译：企业级LLM应用需要在多样化场景中保持高质量与可靠性能，这就要求模型对微小变化具备鲁棒性。现有研究表明，即使提示信息的细微改动也可能导致输出结果的显著差异，但当前研究主要局限于少量扰动类型和小型学术数据集，限制了其与现实应用场景的相关性。为此，我们提出了一套综合性基准测试套件，用于评估模型在多种扰动类型下的鲁棒性，包括通用文本编辑（如标点符号、空格）、格式变更（如JSON、YAML）、多语言及跨语言输入，以及指令位置变化。通过对11个参数量从4B到120B+不等的模型进行评估，我们发现细微扰动可使关键企业指标的性能下降高达40个百分点。更重要的是，我们证明了模型规模与鲁棒性之间的关系比传统假设更为复杂：一个80亿参数的模型（Ministral 3 8B）在多数情况下优于更大型模型，而另一个同规模模型（Llama 3.1 8B）却表现出整体最差的鲁棒性。