Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide
翻译:智能体系统的可靠评估需要无偏估计量与有效的不确定性度量,但现行标准方法在昂贵的人工标注与有偏的LLM-as-judge代理之间摇摆。预测驱动推断(PPI)通过将两者结合为具有有效置信区间的去偏估计量,但现有多种方法仍分散在论文片段中且仅有部分实现。我们提出GLIDE——一个开源Python库,它统一了最先进的PPI估计器(PPI++、分层PPI、先预测后去偏及其分层变体、主动统计推断)与采样器(均匀、分层、主动、成本最优),并采用面向均值估计的scipy风格API。GLIDE附带可复现的蒙特卡洛验证套件、基于经验数据的方法选择决策树,以及一个智能体评估案例研究,证明在同等精度下可大幅节省标注成本。GLIDE软件包可通过以下网址获取:https://github.com/EmertonData/glide