Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.
翻译:准确评估用户满意度对于对话式人工智能的迭代开发至关重要。然而,对于开放式助手而言,传统的A/B测试缺乏可靠的指标:显式反馈稀疏,而隐式指标又具有模糊性。为弥合这一差距,我们提出了BoRP(引导回归探测),一个用于高保真度满意度评估的可扩展框架。与生成式方法不同,BoRP利用了大语言模型潜在空间的几何特性。它采用基于极化指数的引导机制来自动生成评估标准,并利用偏最小二乘法将隐藏状态映射到连续分数。在工业数据集上的实验表明,BoRP(Qwen3-8B/14B)在与人类判断的一致性方面显著优于生成式基线方法(甚至优于Qwen3-Max)。此外,BoRP将推理成本降低了数个数量级,从而能够通过CUPED实现全规模监控和高灵敏度的A/B测试。