Personal Salience: Highlighting Is Social, but Individuality Lives in Selection

Social highlighters let people mark passages that matter to them. We ask how much of an individual is recoverable from these naturalistic traces, using a co-readership identity control (the same document highlighted by many users) that holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does. We separate generic salience (structure), crowd salience (what others marked), and personal salience (the individual residual). First, highlighting is social: which sentences you mark is predicted far better by the crowd than by structure or by a personal model, and even a well-estimated crowd, an information-privileged baseline that sees others' marks on the same document, beats a frontier LLM twin built from your other-document history; the within-document personal signal is at most a whisper (own-vs-other gap +0.017 by an embedding scorer, small but significant). Second, in sharp contrast, individuality lives in selection: asked which of the already-salient passages are yours, your own history is a strong, leakage-free predictor (gap +0.14). A topic decomposition shows this is largely stable thematic preference: it shrinks ~6-8x against a topically-matched peer, and a thin residual cannot be separated from finer topic. The non-obvious part is an asymmetry: under the same scorer the individual signal is ~6-8x weaker in salience than in selection. Methodologically, naive history-conditioning evaluations leak (the target's own marks enter the profile in ~42% of pairs, inflating personal scores by up to +0.15 AP) and small crowds overstate personalization; our results are leakage-free, use a dense crowd, and a model-matched control. Highlights carry a genuine individual signature, but a thin layer over a strong shared one, surfacing far more in which salient things a person selects than in what is salient.

翻译：社交标记工具让人们能够标记对自己重要的段落。我们通过一项共同阅读身份控制实验（同一文档被多名用户标记），在固定文档和主题的前提下，探究从这些自然痕迹中能恢复多少个体信息，并比较个人自身历史比另一位读者历史更能预测其标记的程度。我们区分了通用显著性（结构）、群体显著性（他人标记内容）和个人显著性（个体残差）。首先，标记具有社会性：你标记哪些句子，群体预测效果远好于结构或个人模型；即使是一个信息优势基线——能观察到同一文档上他人标记的充分估计的群体——其表现也优于基于你其他文档历史构建的前沿LLM孪生模型；文档内的个人信号至多是耳语（嵌入评分器上自身与他人差距为+0.017，虽小但显著）。其次，形成鲜明对比的是，个性存在于选择中：当被问及哪些已有显著性的段落属于你时，个人自身历史是一个强且无泄漏的预测因子（差距+0.14）。主题分解表明，这主要源于稳定的主题偏好：与主题匹配的同龄人相比，其效应缩小约6-8倍，而薄层残差无法与更精细的主题分离。非直观之处在于不对称性：在同一评分器下，个体信号在显著性中比在选择中弱约6-8倍。方法上，朴素的条件历史评估存在泄漏（目标自身标记在约42%的配对中进入档案，将个人评分夸大高达+0.15 AP），小群体夸大个性化；我们的结果无泄漏，使用密集群体和模型匹配控制。标记携带真实的个体签名，但仅为强大共享签名之上的薄层，其在个人选择哪些显著性内容上的体现远胜于什么内容具有显著性。