Large language models (LLMs), such as ChatGPT and GPT-4, are gaining wide-spread real world use. Yet, the two LLMs are closed source, and little is known about the LLMs' performance in real-world use cases. In academia, LLM performance is often measured on benchmarks which may have leaked into ChatGPT's and GPT-4's training data. In this paper, we apply and evaluate ChatGPT and GPT-4 for the real-world task of cost-efficient extractive question answering over a text corpus that was published after the two LLMs completed training. More specifically, we extract research challenges for researchers in the field of HCI from the proceedings of the 2023 Conference on Human Factors in Computing Systems (CHI). We critically evaluate the LLMs on this practical task and conclude that the combination of ChatGPT and GPT-4 makes an excellent cost-efficient means for analyzing a text corpus at scale. Cost-efficiency is key for prototyping research ideas and analyzing text corpora from different perspectives, with implications for applying LLMs in academia and practice. For researchers in HCI, we contribute an interactive visualization of 4392 research challenges in over 90 research topics. We share this visualization and the dataset in the spirit of open science.
翻译:大型语言模型(LLMs),如ChatGPT和GPT-4,正日益广泛地应用于现实世界。然而,这两个LLMs属于闭源模型,其在实际使用场景中的表现尚不明确。在学术界,LLM性能通常通过基准测试进行评估,但这类测试数据可能已泄露至ChatGPT和GPT-4的训练集中。本文针对这两个LLM完成训练后发布的文本语料库,应用并评估ChatGPT与GPT-4在低成本抽取式问答任务中的实际表现。具体而言,我们从2023年人机交互计算系统会议(CHI)论文集中提取人机交互领域的研究挑战。通过对此实际任务的批判性评估,我们得出结论:ChatGPT与GPT-4的组合是实现大规模文本语料低成本分析的有效手段。成本效益对原型研究思路和多视角文本语料分析至关重要,这对LLMs在学术界和实践中的运用具有重要启示。我们为人机交互研究者贡献了一份涵盖90余个研究主题、包含4392项研究挑战的交互式可视化成果,并以开放科学的精神共享该可视化工具及数据集。