Measuring the quality of responses generated by LLMs is a challenging task, particularly when it comes to evaluating whether the response is aligned with human preference. A novel approach involves using the LLM itself to make evaluation and stabilizing the results through multiple independent evaluations, similar to a single-layer narrow LLM network. This network consists of a fixed number of neurons, with each neuron being the same LLM. In this paper, we draw upon the extensive research on deep neural networks to explore whether deeper and wider networks can lead to fairer evaluations. Specifically, inspired by the observation that different neurons in a neural network are responsible for detecting different concepts, we first adaptively generate as many neuron roles as possible for each evaluation sample. Each perspective corresponds to the role of a specific LLM neuron in the first layer. In subsequent layers, we follow the idea that higher layers in deep networks are responsible for more comprehensive features, each layer receives representations from all neurons in the previous layer, integrating the locally learned evaluation information to obtain a more comprehensive evaluation result. Interestingly, this network design resembles the process of academic paper reviewing. To validate the effectiveness of our method, we construct the largest and most diverse English evaluation benchmark LLMEval$^2$ for LLM evaluators, comprising 15 tasks, 8 abilities, and 2,553 samples. Experimental results demonstrate that a wider network (involving many reviewers) with 2 layers (one round of discussion) performs the best, improving kappa correlation coefficient from 0.28 to 0.34. We also leverage WideDeep to aid in the assessment of Chinese LLMs, which has accelerated the evaluation time by 4.6 times, resulting in a 60% cost saving. WideDeep achieves a remarkable 93% agreement level among humans.
翻译:衡量大语言模型(LLM)生成回复的质量是一项具有挑战性的任务,尤其是在评估回复是否与人类偏好一致时。一种新颖的方法是让LLM自身进行评估,并通过多次独立评估来稳定结果,这类似于单层窄LLM网络。该网络由固定数量的神经元组成,每个神经元都是相同的LLM。本文借鉴深度神经网络的广泛研究,探究更深、更宽的网络是否能带来更公平的评估。具体而言,受神经网络中不同神经元负责检测不同概念这一观察的启发,我们首先为每个评估样本自适应地生成尽可能多的神经元角色。每个视角对应第一层中特定LLM神经元的角色。在后续层中,我们遵循深度网络中高层负责更综合特征的理念,每一层都接收来自前一层所有神经元的表示,将局部学习到的评估信息整合起来,以获得更全面的评估结果。有趣的是,这种网络设计与学术论文评审过程类似。为验证我们方法的有效性,我们构建了LLM评估器领域最大且最多样化的英文评估基准LLMEval$^2$,包含15个任务、8种能力和2553个样本。实验结果表明,具有2层(一轮讨论)的更宽网络(包含多位评审者)表现最佳,将kappa相关系数从0.28提升至0.34。我们还利用WideDeep辅助中文LLM的评估,使评估时间加速4.6倍,节省60%的成本。WideDeep在人类评估中达到了93%的一致性水平。