This paper introduces a novel evaluation framework for Large Language Models (LLMs) such as Llama-2 and Mistral, focusing on the adaptation of Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals significant insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges faced by current LLMs in generating diverse and high-quality text.
翻译:本文针对Llama-2和Mistral等大语言模型提出了一种新型评估框架,重点将图像生成领域中的精确率与召回率指标适配至文本生成任务。该方法无需对齐语料库即可对生成文本的质量与多样性进行精细化评估。通过对前沿语言模型的全面评估,本研究揭示了这些模型在开放式生成任务中的显著表现特征,而传统基准测试无法充分捕捉这些特征。研究结果表明,生成样本的质量与多样性之间存在权衡关系,尤其是在模型经过人类反馈微调后尤为明显。本工作拓展了基于分布的自然语言处理评估工具集,为当前大语言模型在生成多样且高质量文本方面的实际能力与挑战提供了深刻见解。