Systems incorporating large language models (LLMs) as a component are known to be sensitive (i.e., non-robust) to minor input variations that do not change the meaning of the input; such sensitivity may reduce the system's usefulness. Here, we present a framework to evaluate robustness of systems using COBOL code as input; our application is translation between COBOL and Java programming languages, but the approach extends to other tasks such as code generation or explanation. Targeting robustness of systems with COBOL as input is essential yet challenging. Many business-critical applications are written in COBOL, yet these are typically proprietary legacy applications and their code is unavailable to LLMs for training. We develop a library of COBOL paragraph and full-program perturbation methods, and create variant-expanded versions of a benchmark dataset of examples for a specific task. The robustness of the LLM-based system is evaluated by measuring changes in values of individual and aggregate metrics calculated on the system's outputs. Finally, we present a series of dynamic table and chart visualization dashboards that assist in debugging the system's outputs, and monitoring and understanding root causes of the system's sensitivity to input variation. These tools can be further used to improve the system by, for instance, indicating variations that should be handled by pre-processing steps.
翻译:已知将大型语言模型(LLM)作为组件的系统对不改变输入含义的微小输入变化具有敏感性(即非鲁棒性);这种敏感性可能会降低系统的实用性。本文提出一个框架,用于评估以COBOL代码作为输入的系统鲁棒性;我们的具体应用是COBOL与Java编程语言之间的翻译,但该方法可扩展至代码生成或解释等其他任务。针对以COBOL为输入的系统鲁棒性至关重要且具有挑战性。许多业务关键型应用程序由COBOL编写,但这些通常是专有的遗留应用程序,其代码无法用于LLM训练。我们开发了一个包含COBOL段落级和完整程序级扰动方法的库,并为特定任务创建了基准数据集的变体扩展版本。通过测量系统输出中个体与聚合指标值的变化来评估基于LLM系统的鲁棒性。最后,我们提出一系列动态表格与图表可视化仪表板,用于辅助调试系统输出、监控并理解系统对输入变化敏感性的根本原因。这些工具可进一步用于改进系统,例如通过指示应由预处理步骤处理的变体。