EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context

翻译：EventChat：面向中小企业的基于大语言模型的会话推荐系统实现与用户中心评估

Hannes Kunstmann,Joseph Ollier,Joel Persson,Florian von Wangenheim

from arxiv, Just accepted version

Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of $0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.

翻译：大语言模型（LLMs）在会话推荐系统（CRS）的战略潜力方面展现了巨大变革。然而迄今为止，研究主要聚焦于实现基于LLM的CRS技术框架，而非终端用户评估或对企业（尤其是构成全球经济基石的中小企业）的战略影响。本文详细描述了在中小企业环境中基于LLM的CRS设计及其现场性能——同时采用客观系统指标与主观用户评估。在此过程中，我们提出了适用于评估基于LLM的CRS的简版修正ResQue模型，以确保在快速发展的领域具备可复现性。结果表明，该系统在用户体验层面表现良好（推荐准确率85.5%），但延迟、成本与质量问题威胁商业可行性。值得注意的是，中位交互成本为0.04美元、延迟为5.7秒的背景下，成本效益与响应时间成为实现面向中小企业环境兼具用户友好性与经济可行性的LLM驱动CRS的关键领域。成本的主要驱动因素之一是检索增强生成（RAG）技术中采用先进LLM作为排序器。研究结果还表明，仅依赖诸如以ChatGPT为底层LLM的提示学习等方法，在生产环境中难以达到满意质量。本文最后提出了中小企业部署基于LLM的CRS的战略考量，尤其关注当前技术环境中的权衡取舍。