When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.
翻译:当通过API提供商使用大型语言模型时,用户期望所服务的模型能够随时间保持一致性,这一特性对于下游应用的可靠性和研究的可复现性至关重要。现有审计方法成本过高,无法以固定时间间隔应用于各类可用的大型语言模型API。这意味着在实践中模型更新基本处于未监控状态。本研究表明,虽然大型语言模型的对数概率通常具有非确定性特征,但仍可作为经济高效的持续监控大型语言模型API的基础。我们采用基于各标记对数概率平均值的简单统计检验方法,仅需请求单个标记输出。该方法足以检测到小至单步微调级别的模型变化,在实现比现有方法更高灵敏度的同时,将成本降低至千分之一。我们提出TinyChange基准测试,用以衡量审计方法在检测微小且符合实际的模型变化时的灵敏度。