Large language model evaluation plays a pivotal role in the enhancement of its capacity. Previously, numerous methods for evaluating large language models have been proposed in this area. Despite their effectiveness, these existing works mainly focus on assessing objective questions, overlooking the capability to evaluate subjective questions which is extremely common for large language models. Additionally, these methods predominantly utilize centralized datasets for evaluation, with question banks concentrated within the evaluation platforms themselves. Moreover, the evaluation processes employed by these platforms often overlook personalized factors, neglecting to consider the individual characteristics of both the evaluators and the models being evaluated. To address these limitations, we propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models that employs a competitive scoring mechanism where users participate in ranking models based on their performance. This platform stands out not only for its support of centralized evaluations to assess the general capabilities of models but also for offering an open evaluation gateway. Through this gateway, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities. Furthermore, our platform introduces personalized evaluation scenarios, leveraging various forms of human-computer interaction to assess large language models in a manner that accounts for individual user preferences and contexts. The demonstration of BingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.
翻译:大型语言模型评估在提升其能力方面发挥着关键作用。此前,该领域已提出众多评估方法。尽管现有工作有效,但它们主要聚焦于客观题评估,忽视了大型语言模型中极为常见的主观题评估能力。此外,这些方法主要依赖集中式数据集进行评估,题库集中在评估平台内部。同时,这些平台采用的评估流程往往忽视个性化因素,未能考虑评估者和被评估模型的个体特征。为解决这些局限,我们提出名为BingJian的新型匿名众包评估平台,该平台采用竞争性评分机制,用户可根据模型表现参与排名。该平台的独特之处不仅在于支持集中式评估以衡量模型的通用能力,还提供开放式评估接口。通过该接口,用户可提交个性化问题,在更广泛的能力维度上测试模型。此外,平台引入个性化评估场景,利用多种人机交互形式,在考虑用户个体偏好与情境的基础上评估大型语言模型。BingJian的演示可在https://github.com/Mingyue-Cheng/Bingjian获取。