This paper introduces a novel evaluation framework for Large Language Models (LLMs) such as Llama-2 and Mistral, focusing on the adaptation of Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals significant insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges faced by current LLMs in generating diverse and high-quality text.
翻译:本文介绍了一种针对大型语言模型(LLMs)如Llama-2和Mistral的新型评估框架,重点是将图像生成领域中的精确率与召回率指标适配至文本生成任务。该方法无需对齐语料库,即可对生成文本的质量与多样性进行细致评估。通过对最先进语言模型的全面评估,研究揭示了传统基准无法充分捕捉的开放式生成任务中的关键性能表现。研究结果凸显了生成样本质量与多样性之间的权衡关系,尤其是在模型通过人类反馈进行微调之后。本研究扩展了基于分布的NLP评估工具集,为当前LLMs在生成多样且高质量文本时的实际能力与挑战提供了深刻见解。