The Stability Trap: Evaluating the Reliability of LLM-Based Instruction Adherence Auditing

The enterprise governance of Generative AI (GenAI) in regulated sectors, such as Human Resources (HR), demands scalable yet reproducible auditing mechanisms. While Large Language Model (LLM)-as-a-Judge approaches offer scalability, their reliability in evaluating adherence of different types of system instructions remains unverified. This study asks: To what extent does the instruction type of an Application Under Test (AUT) influence the stability of judge evaluations? To address this, we introduce the Scoped Instruction Decomposition Framework to classify AUT instructions into Objective and Subjective types, isolating the factors that drive judge instability. We applied this framework to two representative HR GenAI applications, evaluating the stability of four judge architectures over variable runs. Our results reveal a ``Stability Trap'' characterized by a divergence between Verdict Stability and Reasoning Stability. While judges achieved near-perfect verdict agreement ($>99\%$) for both objective and subjective evaluations, their accompanying justification traces diverged significantly. Objective instructions requiring quantitative analysis, such as word counting, exhibited reasoning stability as low as $\approx19\%$, driven by variances in numeric justifications. Similarly, reasoning stability for subjective instructions varied widely ($35\%$--$83\%$) based on evidence granularity, with feature-specific checks failing to reproduce consistent rationale. Conversely, objective instructions focusing on discrete entity extraction achieved high reasoning stability ($>90\%$). These findings demonstrate that high verdict stability can mask fragile reasoning. Thus, we suggest that auditors scope automated evaluation protocols strictly: delegate all deterministically verifiable logic to code, while reserving LLM judges for complex semantic evaluation.

翻译：在受监管领域（如人力资源）中，生成式人工智能的企业治理需要可扩展且可复现的审计机制。尽管“大型语言模型即裁判”方法提供了可扩展性，但其在评估不同类型系统指令遵循情况时的可靠性尚未得到验证。本研究提出：被测应用程序的指令类型在多大程度上影响裁判评估的稳定性？为此，我们提出了范围化指令分解框架，将被测应用程序指令分类为目标型和主观型，以分离导致裁判不稳定的因素。我们将该框架应用于两个具有代表性的人力资源生成式人工智能应用程序，评估了四种裁判架构在多次运行中的稳定性。我们的结果揭示了一种“稳定性陷阱”，其特征表现为裁决稳定性与推理稳定性之间的背离。尽管裁判在客观和主观评估中均达到了近乎完美的裁决一致性（>99%），但其伴随的论证轨迹却存在显著分歧。需要进行定量分析（如字数统计）的客观指令，其推理稳定性可低至≈19%，这主要源于数值论证的差异。同样，主观指令的推理稳定性也因证据粒度差异而波动较大（35%–83%），基于特定特征的检查无法复现一致的推理依据。相反，专注于离散实体提取的客观指令则实现了较高的推理稳定性（>90%）。这些发现表明，高裁决稳定性可能掩盖脆弱的推理过程。因此，我们建议审计人员严格限定自动化评估规程的范围：将所有可确定性验证的逻辑委托给代码执行，而将LLM裁判保留用于复杂的语义评估。