UQABench：评估用户嵌入在个性化问答中提示大语言模型的效果 (UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering)

Large language models (LLMs) achieve remarkable success in natural language processing (NLP). In practical scenarios like recommendations, as users increasingly seek personalized experiences, it becomes crucial to incorporate user interaction history into the context of LLMs to enhance personalization. However, from a practical utility perspective, user interactions' extensive length and noise present challenges when used directly as text prompts. A promising solution is to compress and distill interactions into compact embeddings, serving as soft prompts to assist LLMs in generating personalized responses. Although this approach brings efficiency, a critical concern emerges: Can user embeddings adequately capture valuable information and prompt LLMs? To address this concern, we propose \name, a benchmark designed to evaluate the effectiveness of user embeddings in prompting LLMs for personalization. We establish a fair and standardized evaluation process, encompassing pre-training, fine-tuning, and evaluation stages. To thoroughly evaluate user embeddings, we design three dimensions of tasks: sequence understanding, action prediction, and interest perception. These evaluation tasks cover the industry's demands in traditional recommendation tasks, such as improving prediction accuracy, and its aspirations for LLM-based methods, such as accurately understanding user interests and enhancing the user experience. We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings. Additionally, we reveal the scaling laws of leveraging user embeddings to prompt LLMs. The benchmark is available online.

翻译：大语言模型（LLMs）在自然语言处理（NLP）领域取得了显著成功。在推荐等实际应用场景中，随着用户日益追求个性化体验，将用户交互历史纳入LLMs的上下文以增强个性化变得至关重要。然而，从实际效用角度看，用户交互内容长度大且含有噪声，直接作为文本提示使用存在挑战。一种有前景的解决方案是将交互信息压缩和提炼为紧凑的嵌入表示，作为软提示来辅助LLMs生成个性化响应。尽管这种方法带来了效率提升，但一个关键问题随之出现：用户嵌入能否充分捕获有价值的信息并有效提示LLMs？为探究此问题，我们提出\name，这是一个旨在评估用户嵌入在提示LLMs实现个性化方面有效性的基准。我们建立了一个公平、标准化的评估流程，涵盖预训练、微调和评估阶段。为全面评估用户嵌入，我们设计了三个维度的任务：序列理解、行为预测和兴趣感知。这些评估任务既涵盖了业界在传统推荐任务中的需求（如提升预测准确性），也包含了其对基于LLM的方法的期望（如准确理解用户兴趣、提升用户体验）。我们对多种最先进的用户嵌入建模方法进行了广泛实验。此外，我们揭示了利用用户嵌入提示LLMs的缩放规律。该基准已在线发布。