The assessment of advanced generative large language models (LLMs) poses a significant challenge, given their heightened complexity in recent developments. Furthermore, evaluating the performance of LLM-based applications in various industries, as indicated by Key Performance Indicators (KPIs), is a complex undertaking. This task necessitates a profound understanding of industry use cases and the anticipated system behavior. Within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (ConvQA) systems. The unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. To address these challenges, this paper introduces a set of KPIs tailored for evaluating the performance of in-car ConvQA systems, along with datasets specifically designed for these KPIs. A preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. Furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.
翻译:先进生成式大语言模型(LLMs)的评估面临重大挑战,因其近期发展呈现高度复杂性。此外,根据关键绩效指标(KPIs)评估基于LLM的行业应用性能是一项复杂任务,需要深入理解行业用例及预期系统行为。在汽车行业背景下,现有评估指标无法充分衡量车载对话问答(ConvQA)系统。这类系统的独特需求——其回答可能关乎驾驶员或车辆安全,且仅限汽车领域——凸显了当前指标的局限性。为解决这些问题,本文引入了一套专用于评估车载ConvQA系统性能的KPI指标,以及针对这些指标设计的专用数据集。初步的全面实证评估验证了所提方法的有效性。此外,我们探究了在提示词中使用不同角色设置的影响,发现这能增强模型模拟评估中多元视角的能力,反映不同背景人群对同一主题的感知差异。