Social highlighters let people mark passages that matter to them. We ask how much of an individual is recoverable from these naturalistic traces, using a co-readership identity control (the same document highlighted by many users) that holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does. We separate generic salience (structure), crowd salience (what others marked), and personal salience (the individual residual). First, highlighting is social: which sentences you mark is predicted far better by the crowd than by structure or by a personal model, and even a well-estimated crowd, an information-privileged baseline that sees others' marks on the same document, beats a frontier LLM twin built from your other-document history; the within-document personal signal is at most a whisper (own-vs-other gap +0.017 by an embedding scorer, small but significant). Second, in sharp contrast, individuality lives in selection: asked which of the already-salient passages are yours, your own history is a strong, leakage-free predictor (gap +0.14). A topic decomposition shows this is largely stable thematic preference: it shrinks ~6-8x against a topically-matched peer, and a thin residual cannot be separated from finer topic. The non-obvious part is an asymmetry: under the same scorer the individual signal is ~6-8x weaker in salience than in selection. Methodologically, naive history-conditioning evaluations leak (the target's own marks enter the profile in ~42% of pairs, inflating personal scores by up to +0.15 AP) and small crowds overstate personalization; our results are leakage-free, use a dense crowd, and a model-matched control. Highlights carry a genuine individual signature, but a thin layer over a strong shared one, surfacing far more in which salient things a person selects than in what is salient.
翻译:社交标记工具让人们能够标记对自己重要的段落。我们通过一项共同阅读身份控制实验(同一文档被多名用户标记),在固定文档和主题的前提下,探究从这些自然痕迹中能恢复多少个体信息,并比较个人自身历史比另一位读者历史更能预测其标记的程度。我们区分了通用显著性(结构)、群体显著性(他人标记内容)和个人显著性(个体残差)。首先,标记具有社会性:你标记哪些句子,群体预测效果远好于结构或个人模型;即使是一个信息优势基线——能观察到同一文档上他人标记的充分估计的群体——其表现也优于基于你其他文档历史构建的前沿LLM孪生模型;文档内的个人信号至多是耳语(嵌入评分器上自身与他人差距为+0.017,虽小但显著)。其次,形成鲜明对比的是,个性存在于选择中:当被问及哪些已有显著性的段落属于你时,个人自身历史是一个强且无泄漏的预测因子(差距+0.14)。主题分解表明,这主要源于稳定的主题偏好:与主题匹配的同龄人相比,其效应缩小约6-8倍,而薄层残差无法与更精细的主题分离。非直观之处在于不对称性:在同一评分器下,个体信号在显著性中比在选择中弱约6-8倍。方法上,朴素的条件历史评估存在泄漏(目标自身标记在约42%的配对中进入档案,将个人评分夸大高达+0.15 AP),小群体夸大个性化;我们的结果无泄漏,使用密集群体和模型匹配控制。标记携带真实的个体签名,但仅为强大共享签名之上的薄层,其在个人选择哪些显著性内容上的体现远胜于什么内容具有显著性。