Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study employs a comprehensive methodology to evaluate foundational models, encompassing problem formulation, data analysis, and the development of novel outlier detection techniques. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.
翻译:近年来,大语言模型(LLMs)已扩展到各个领域。然而,仍然需要评估这些模型在处理常见查询与领域特定查询时的表现,这对于在微调领域特定下游任务之前进行基准测试可能具有参考价值。本研究评估了LLMs(特别是Gemma-2B和Gemma-7B)在网络安全、医学和金融等多个领域与常识知识查询相比的性能。本研究采用了一种全面的方法来评估基础模型,包括问题构建、数据分析以及新颖的异常值检测技术的开发。这种方法的严谨性增强了所提出的评估框架的可信度。本研究侧重于评估推理时间、响应长度、吞吐量、质量和资源利用率,并探究了这些因素之间的相关性。结果表明,模型大小和用于推理的提示类型显著影响响应长度和质量。此外,包含各类查询的通用提示会在不规则的时间间隔内产生多样且不一致的响应。相比之下,领域特定提示则能在合理时间内持续生成简洁的响应。总体而言,本研究强调了建立全面评估框架的必要性,以提升多领域人工智能研究中基准测试程序的可靠性。