Evaluation is pivotal for refining Large Language Models (LLMs), pinpointing their capabilities, and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, considering various implementation details, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into research workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration. UltraEval is now available for researchers publicly.
翻译:评估对于改进大语言模型、精确识别其能力以及指导模型增强至关重要。大语言模型的快速发展,亟需一个轻量级且易于使用的框架,以实现快速评估部署。然而,考虑到各种实现细节,开发一个全面的评估平台绝非易事。现有平台往往结构复杂且模块化程度低,阻碍了其无缝融入研究流程。本文介绍 UltraEval,一个以轻量、全面、模块化和高效为特点的用户友好型评估框架。我们识别并重新实现了模型评估的三个核心组件(模型、数据与指标)。由此产生的可组合性允许在统一的评估工作流中,自由组合不同的模型、任务、提示、基准测试和评估指标。此外,得益于统一的 HTTP 服务,UltraEval 支持多种模型,并提供充分的推理加速。UltraEval 现已公开供研究人员使用。