How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study employs a comprehensive methodology to evaluate foundational models, encompassing problem formulation, data analysis, and the development of novel outlier detection techniques. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.

翻译：近年来，大语言模型（LLMs）已扩展到各个领域。然而，仍然需要评估这些模型在处理常见查询与领域特定查询时的表现，这对于在微调领域特定下游任务之前进行基准测试可能具有参考价值。本研究评估了LLMs（特别是Gemma-2B和Gemma-7B）在网络安全、医学和金融等多个领域与常识知识查询相比的性能。本研究采用了一种全面的方法来评估基础模型，包括问题构建、数据分析以及新颖的异常值检测技术的开发。这种方法的严谨性增强了所提出的评估框架的可信度。本研究侧重于评估推理时间、响应长度、吞吐量、质量和资源利用率，并探究了这些因素之间的相关性。结果表明，模型大小和用于推理的提示类型显著影响响应长度和质量。此外，包含各类查询的通用提示会在不规则的时间间隔内产生多样且不一致的响应。相比之下，领域特定提示则能在合理时间内持续生成简洁的响应。总体而言，本研究强调了建立全面评估框架的必要性，以提升多领域人工智能研究中基准测试程序的可靠性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/