Large language models (LLMs) enable powerful zero-shot recommendations by leveraging broad contextual knowledge, yet predictive uncertainty and embedded biases threaten reliability and fairness. This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of LLM-generated recommendations. We introduce a benchmark of curated metrics and a dataset annotated for eight demographic attributes (31 categorical values) across two domains: movies and music. Through in-depth case studies, we quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind's Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes; measured similarity-based gaps are SNSR at 0.1363 and SNSV at 0.0507. These disparities persist under prompt perturbations such as typographical errors and multilingual inputs. We further integrate personality-aware fairness into the RecLLM evaluation pipeline to reveal personality-linked bias patterns and expose trade-offs between personalization and group fairness. We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark that advances explainability and equity in LLM recommendations. Together, these contributions establish a foundation for safer, more interpretable RecLLMs and motivate future work on multi-model benchmarks and adaptive calibration for trustworthy deployment.
翻译:大语言模型(LLMs)通过利用广泛的上下文知识实现了强大的零样本推荐,然而预测不确定性和内在偏见威胁着其可靠性与公平性。本文研究了不确定性与公平性评估如何影响LLM生成推荐的准确性、一致性和可信度。我们引入了一个包含精选指标的基准数据集,该数据集针对电影和音乐两个领域中的八个人口统计属性(31个分类值)进行了标注。通过深入的案例研究,我们量化了预测不确定性(通过熵度量),并证明Google DeepMind的Gemini 1.5 Flash模型对某些敏感属性表现出系统性不公平;基于相似度测量的差距值为SNSR 0.1363和SNSV 0.0507。这些差异在拼写错误和多语言输入等提示扰动下持续存在。我们进一步将人格感知公平性整合到RecLLM评估流程中,以揭示与人格相关的偏见模式,并展现在个性化与群体公平性之间的权衡。我们提出了一种新颖的面向RecLLM的不确定性感知评估方法,呈现了深度不确定性案例研究的实证见解,并引入了一个基于人格画像的公平性基准,该基准提升了LLM推荐的可解释性与公平性。这些贡献共同为构建更安全、更可解释的RecLLM奠定了基础,并激励未来在多模型基准测试和自适应校准以实现可信部署方面的研究。