Our paper studies the predictability of online speech -- that is, how well language models learn to model the distribution of user generated content on X (previously Twitter). We define predictability as a measure of the model's uncertainty, i.e. its negative log-likelihood. As the basis of our study, we collect 10M tweets for ``tweet-tuning'' base models and a further 6.25M posts from more than five thousand X (previously Twitter) users and their peers. In our study involving more than 5000 subjects, we find that predicting posts of individual users remains surprisingly hard. Moreover, it matters greatly what context is used: models using the users' own history significantly outperform models using posts from their social circle. We validate these results across four large language models ranging in size from 1.5 billion to 70 billion parameters. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on it. We follow up with a detailed investigation on what is learned in-context and a demographic analysis. Up to 20\% of what is learned in-context is the use of @-mentions and hashtags. Our main results hold across the demographic groups we studied.
翻译:本文研究了在线言论的可预测性——即语言模型学习建模X平台(原Twitter)用户生成内容分布的能力。我们定义可预测性为模型不确定性的度量,即其负对数似然。作为研究基础,我们收集了1000万条推文用于基础模型的"推文调优",并进一步从五千余名X平台(原Twitter)用户及其社交网络中采集了625万条推文。通过对5000余名研究对象的研究,我们发现预测个体用户的发帖内容仍然异常困难。更重要的是,使用何种上下文至关重要:利用用户自身历史记录的模型表现显著优于使用其社交圈发帖内容的模型。我们在参数量从15亿到700亿不等的四个大型语言模型上验证了这些结果。此外,即使采用微调而非上下文提示的方式,实验结果依然具有可复现性。我们进一步深入研究了上下文学习的具体内容并进行了人口统计学分析。上下文学习中高达20%的内容涉及@提及和话题标签的使用。我们的主要研究结论在所考察的所有人口统计学群体中均保持成立。