Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.

翻译：英国《在线安全法案》和欧盟《数字服务法案》下的在线安全监管日益将标量指标视为合规证据。一旦公布，此类指标也会成为优化目标：策略性平台可通过路由推荐至语义等效的内容变体来提升其评分，而无需减少真实危害。我们探讨审计指标何时仍能可靠验证危害的真实减少。该协议被建模为已发布的变换图，其连通分量形成语义类，而指标本身被视为安全对象。由此得出三个结论。首先，任何直接对变体评分的指标，一旦危害类中两个等效变体的评分存在差异，即可被操纵。其次，语义包络提升（为每个变体分配其类内最高评分）是保守类内常数修复中唯一的逐点最小值。第三，类分层证书 $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$ 对所有平台策略成立，其中 $\barη$ 吸收注释和协议误差。我们从三个层面检验这些结论：混合策略有限状态网格上的穷举枚举、基于 Z3 并通过 cvc5 交叉复现的 SMT 编码、以及用 PRISM-games 编码的有界单玩家 MDP。脆弱指标无法满足操纵不变性，也无法支持同样有用的预声明类覆盖证书；在包络级证书下，该指标在每次测试实例中都产生较大违规，且在固定审计预算下随机目录中呈现较大的平均操纵差距。而语义包络指标在测试实例中未出现此类违规。