The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

翻译：摘要：社交高亮工具最关键的信号——读者群体标记的段落——仅存在于已被阅读的文档中。能否在标记累积之前，从文档文本预测其整体众包显著性？先前针对该数据的研究发现，零样本语言模型对高亮位置的还原效果甚至不如简单的前置（位置）基线。本研究旨在探究：基于高亮语料库训练的模型能否超越该基线？通过预先注册的模型阶梯和按文档分组的聚类自助法，我们发现了一个虽小但稳健的优势：基于句子嵌入与位置/上下文特征的逻辑排序器在前置基线上实现了平均精确率提升+0.044（95%置信区间[+0.029，+0.058]；在97%的重采样中超过预先注册的边界δ=0.03，且跨流水线复现保持稳定）。两种无监督提取基线（质心法、LexRank式中心性）均逊于前置基线，而训练模型超出其+0.108，表明该优势并非源于通用无监督代理——而是源自对真实读者标记的学习。从产品角度，精确率@3从0.25提升至0.39（相对+55%），模型在69%的文档上超越前置基线。消融实验表明，该优势分别来自原始嵌入（+0.014）和训练增强（+0.010），两者置信区间均为正。该优势并非时间泛化失败所致，且无证据表明内容漂移或近似重复泄漏可解释该现象。标准化回归显示，优势主要受文档流行度（流行度越低，优势越大）和标签可靠性调控。该优势仅在最高流行度内容上近乎消失——此时是前置基线增强，而非模型性能下降。由于评估条件限定于最终累积到读者的文档，这些结果构成了一种回顾性冷启动模拟。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

DeepSeek-V4：百万 Token 上下文背后，大模型正在进入“长程智能”时代（附中英文pdf版）

专知会员服务

24+阅读 · 4月24日

Google发布69 页《提示工程》白皮书，介绍 Prompt Engineering 及其最佳实践

专知会员服务

52+阅读 · 2025年4月10日

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日