There is increasing interest in the adoption of LLMs in HCI research. However, LLMs may often be regarded as a panacea because of their powerful capabilities with an accompanying oversight on whether they are suitable for their intended tasks. We contend that LLMs should be adopted in a critical manner following rigorous evaluation. Accordingly, we present the evaluation of an LLM in identifying logical fallacies that will form part of a digital misinformation intervention. By comparing to a labeled dataset, we found that GPT-4 achieves an accuracy of 0.79, and for our intended use case that excludes invalid or unidentified instances, an accuracy of 0.90. This gives us the confidence to proceed with the application of the LLM while keeping in mind the areas where it still falls short. The paper describes our evaluation approach, results and reflections on the use of the LLM for our intended task.
翻译:人机交互(HCI)研究中对采用大语言模型(LLM)的兴趣日益增长。然而,LLM因其强大能力常被视为万能解决方案,却伴随对其是否适合预期任务的监督缺失。我们认为,应当以严谨评估为前提,批判性地采用LLM。为此,我们评估了LLM在识别逻辑谬误中的表现——该任务将作为数字虚假信息干预的一部分。通过与带标签数据集进行对比,我们发现GPT-4的准确率达到0.79,而在排除无效或未识别实例的预期用例中,准确率为0.90。这使我们对继续应用该LLM充满信心,同时需关注其仍存在的不足。本文阐述了我们的评估方法、结果及对LLM在预期任务中应用的反思。