We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current LLMs face in generating diverse and high-quality text. We release our code and data.
翻译:本文针对\textsc{Llama-2}和\textsc{Mistral}等大语言模型,提出了一种新颖的评估框架,其核心在于将图像生成领域中的精确率与召回率指标引入文本生成任务。该方法无需对齐语料库,即可对生成文本的质量与多样性进行细致评估。通过对当前先进语言模型的全面评测,本研究揭示了这些模型在开放式生成任务中表现出的新特性,而这些特性是传统基准测试所未能充分捕捉的。研究结果凸显了生成样本在质量与多样性之间存在的权衡关系,尤其是在模型经过指令数据集微调或结合人类反馈进行训练时。本工作拓展了基于分布的自然语言处理评估工具集,为理解当前大语言模型在生成多样化、高质量文本时所面临的实际能力与挑战提供了新见解。我们已公开相关代码与数据。