The Accuracy of Domain Specific and Descriptive Analysis Generated by Large Language Models

Large language models (LLMs) have attracted considerable attention as they are capable of showcasing impressive capabilities generating comparable high-quality responses to human inputs. LLMs, can not only compose textual scripts such as emails and essays but also executable programming code. Contrary, the automated reasoning capability of these LLMs in performing statistically-driven descriptive analysis, particularly on user-specific data and as personal assistants to users with limited background knowledge in an application domain who would like to carry out basic, as well as advanced statistical and domain-specific analysis is not yet fully explored. More importantly, the performance of these LLMs has not been compared and discussed in detail when domain-specific data analysis tasks are needed. This study, consequently, explores whether LLMs can be used as generative AI-based personal assistants to users with minimal background knowledge in an application domain infer key data insights. To demonstrate the performance of the LLMs, the study reports a case study through which descriptive statistical analysis, as well as Natural Language Processing (NLP) based investigations, are performed on a number of phishing emails with the objective of comparing the accuracy of the results generated by LLMs to the ones produced by analysts. The experimental results show that LangChain and the Generative Pre-trained Transformer (GPT-4) excel in numerical reasoning tasks i.e., temporal statistical analysis, achieve competitive correlation with human judgments on feature engineering tasks while struggle to some extent on domain specific knowledge reasoning, where domain-specific knowledge is required.

翻译：大语言模型（LLMs）因其能够针对人类输入生成媲美高质量回答的卓越能力而受到广泛关注。LLMs不仅能撰写电子邮件、文章等文本内容，还可生成可执行的程序代码。然而，这些模型在基于统计的描述性分析中的自动推理能力——尤其是在处理用户特定数据时，以及作为应用领域背景知识有限的用户的个人助手，协助其完成基础及高级的统计与领域特定分析——尚未得到充分探索。更重要的是，当需要进行领域特定的数据分析任务时，这些LLMs的性能尚未被详细比较与讨论。因此，本研究探讨了LLMs能否作为基于生成式人工智能的个人助手，帮助应用领域背景知识有限的用户推断关键数据洞见。为验证LLMs的性能，本研究通过一个案例进行分析，对多封钓鱼邮件进行了描述性统计分析以及基于自然语言处理（NLP）的检测研究，旨在比较LLMs生成结果与分析人员所得结果的准确性。实验结果表明，LangChain与生成式预训练Transformer（GPT-4）在数值推理任务（即时序统计分析）中表现优异，在特征工程任务上达到与人类判断相竞争的相关性，但在需要领域特定知识的领域知识推理方面仍存在一定局限。