As an intriguing case is the goodness of the machine and deep learning models generated by these LLMs in conducting automated scientific data analysis, where a data analyst may not have enough expertise in manually coding and optimizing complex deep learning models and codes and thus may opt to leverage LLMs to generate the required models. This paper investigates and compares the performance of the mainstream LLMs, such as ChatGPT, PaLM, LLama, and Falcon, in generating deep learning models for analyzing time series data, an important and popular data type with its prevalent applications in many application domains including financial and stock market. This research conducts a set of controlled experiments where the prompts for generating deep learning-based models are controlled with respect to sensitivity levels of four criteria including 1) Clarify and Specificity, 2) Objective and Intent, 3) Contextual Information, and 4) Format and Style. While the results are relatively mix, we observe some distinct patterns. We notice that using LLMs, we are able to generate deep learning-based models with executable codes for each dataset seperatly whose performance are comparable with the manually crafted and optimized LSTM models for predicting the whole time series dataset. We also noticed that ChatGPT outperforms the other LLMs in generating more accurate models. Furthermore, we observed that the goodness of the generated models vary with respect to the ``temperature'' parameter used in configuring LLMS. The results can be beneficial for data analysts and practitioners who would like to leverage generative AIs to produce good prediction models with acceptable goodness.
翻译:作为一个引人关注的案例,由这些大型语言模型(LLMs)生成的机器学习和深度学习模型在自动化科学数据分析中的表现值得探讨。数据分析师可能缺乏手动编码和优化复杂深度学习模型与代码的足够专业知识,因此可能选择利用LLMs来生成所需模型。本文研究并比较了主流LLMs(如ChatGPT、PaLM、LLama和Falcon)在生成用于分析时间序列数据的深度学习模型方面的性能。时间序列数据是一种重要且流行的数据类型,在金融和股票市场等众多应用领域中具有广泛应用。本研究进行了一系列对照实验,其中生成基于深度学习的模型的提示在以下四个标准的敏感度水平上受控:1)清晰度与特异性,2)目标与意图,3)上下文信息,以及4)格式与风格。虽然结果相对混杂,但我们观察到了一些明显的模式。我们注意到,使用LLMs能够为每个数据集分别生成具有可执行代码的基于深度学习的模型,其性能与手工构建和优化的LSTM模型在预测整个时间序列数据集方面相当。我们还发现,ChatGPT在生成更准确模型方面优于其他LLMs。此外,我们观察到生成模型的质量随LLMs配置中使用的“温度”参数而变化。这些结果对于希望利用生成式人工智能来产生具有可接受质量的良好预测模型的数据分析师和实践者可能有益。