Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems.
翻译:大语言模型(LLMs)在呈现陡峭幂律分布的网页规模语料库上进行训练,其中知识的分布具有高度的长尾性,大部分知识出现频率极低。尽管模型规模的扩大提升了平均性能,但其在低频、领域特定、文化及时序知识上的持续失败仍未得到充分表征。本文构建了一个关于大语言模型中长尾知识的结构化分类与分析体系,综合了技术与社会技术视角下的先前研究。我们引入了一个结构化分析框架,该框架从四个互补维度综合了现有工作:长尾知识的定义方式、其在训练与推理过程中丢失或扭曲的机制、为缓解这些失败所提出的技术干预措施,以及这些失败对公平性、问责制、透明度和用户信任的影响。我们进一步探讨了现有评估实践如何掩盖尾部行为,并使对罕见但影响重大的失败进行问责变得复杂。本文最后指出了与隐私、可持续性和治理相关的开放挑战,这些挑战制约了长尾知识的表征。综上所述,本文提供了一个统一的概念框架,用于理解长尾知识在已部署的语言模型系统中如何被定义、丢失、评估及显现。