Large Language Models (LLMs) have achieved significant success across various natural language processing (NLP) tasks, encompassing question-answering, summarization, and machine translation, among others. While LLMs excel in general tasks, their efficacy in domain-specific applications remains under exploration. Additionally, LLM-generated text sometimes exhibits issues like hallucination and disinformation. In this study, we assess LLMs' capability of producing concise survey articles within the computer science-NLP domain, focusing on 20 chosen topics. Automated evaluations indicate that GPT-4 outperforms GPT-3.5 when benchmarked against the ground truth. Furthermore, four human evaluators provide insights from six perspectives across four model configurations. Through case studies, we demonstrate that while GPT often yields commendable results, there are instances of shortcomings, such as incomplete information and the exhibition of lapses in factual accuracy.
翻译:大规模语言模型(Large Language Models, LLMs)已在各类自然语言处理任务中取得显著成功,涵盖问答、摘要生成及机器翻译等。尽管LLMs在通用任务中表现卓越,但其在特定领域应用中的效能仍有待探索。此外,LLMs生成的文本有时会出现幻觉、虚假信息等问题。本研究聚焦于计算机科学-自然语言处理领域,选取20个特定主题,评估LLMs生成简洁综述文章的能力。自动化评估表明,GPT-4在基于基准真值的对比中优于GPT-3.5。同时,四位人类评估者从四种模型配置的六个视角提供了深入分析。通过案例研究,我们表明GPT虽常能产出令人满意的结果,但亦存在信息不完整及事实准确性缺失等缺陷。