Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.
翻译:第三方大语言模型(LLM)API网关正迅速成为连接多个供应商模型的统一访问入口。然而,这些网关的内部路由、缓存和计费策略大多未公开披露,导致用户难以了解请求是否由声明的模型提供服务、响应是否忠实于上游API、以及账单是否准确反映公开定价政策。为填补这一空白,我们提出GateScope——一个用于评估商业LLM网关行为一致性与运营透明度的轻量级黑盒测量框架。GateScope旨在通过沿四个关键维度审计网关来检测主要异常行为,包括模型降级或切换、静默截断、计费不准确以及延迟不稳定性:响应内容分析、多轮对话性能、计费准确性和延迟特征。我们对10个真实商业LLM API网关的测量揭示了预期行为与实际行为之间的频繁差距,包括静默模型替换、记忆留存能力下降、与公布定价的偏差,以及不同平台间延迟稳定性的显著差异。