The predictive probability of the next token (P_token) in large language models (LLMs) is inextricably linked to the probability of relevance for the next piece of information, the purchase probability of the next product, and the execution probability of the next action-all of which fall under the scope of the task-level target distribution (P_task). While LLMs are known to generate samples that approximate real-world distributions, whether their fine-grained sampling probabilities faithfully align with task requirements remains an open question. Through controlled distribution-sampling simulations, we uncover a striking dichotomy in LLM behavior, distinguishing two model types: D-models (e.g. Qwen-2.5), whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models (e.g. Mistral-Small), whose P_token is more stable and better aligned with P_task. We further evaluate these two model types in downstream tasks such as code generation and recommendation, revealing systematic trade-offs between diversity and stability that shape task outcomes. Finally, we analyze the internal properties of both model families to probe their underlying mechanisms. These findings offer foundational insights into the probabilistic sampling behavior of LLMs and provide practical guidance on when to favor D- versus E-models. For web-scale applications, including recommendation, search, and conversational agents, our results inform model selection and configuration to balance diversity with reliability under real-world uncertainty, providing a better level of interpretation.
翻译:大语言模型(LLMs)中下一个词元的预测概率(P_token)与下一段信息的关联概率、下一个商品的购买概率以及下一个动作的执行概率密不可分——所有这些都属于任务级目标分布(P_task)的范畴。尽管已知LLMs生成的样本能够近似真实世界分布,但其细粒度采样概率是否忠实地符合任务要求,仍是一个悬而未决的问题。通过受控的分布采样模拟,我们揭示了LLM行为中存在一个显著的二分现象,从而区分出两种模型类型:D-模型(例如Qwen-2.5),其P_token表现出较大的步间变异性,且与P_task的对齐性较差;以及E-模型(例如Mistral-Small),其P_token更为稳定,且与P_task的对齐性更好。我们进一步在代码生成和推荐等下游任务中评估了这两种模型类型,揭示了塑造任务结果的多样性与稳定性之间的系统性权衡。最后,我们分析了两类模型家族的内部特性,以探究其底层机制。这些发现为理解LLMs的概率采样行为提供了基础性见解,并为何时应优先选择D-模型或E-模型提供了实用指导。对于包括推荐、搜索和对话代理在内的网络级应用,我们的研究结果为模型选择和配置提供了依据,以便在现实世界的不确定性下平衡多样性与可靠性,从而提供更好的可解释性。