In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.
翻译:本文提出精度引导语义建模(PRISM)——一种结构化主题建模框架,融合了大型语言模型(LLM)捕获的丰富表征与潜在语义聚类方法的低成本和可解释性优势。PRISM利用从某语料库中抽取的样本上稀疏的LLM标签集,微调句子编码模型。我们通过阈值化聚类对该嵌入空间进行切分,形成能够分离狭窄领域内紧密相关主题的聚类。在多个语料库上,PRISM较当前最先进的局部主题模型提升了主题可分离性,甚至优于基于大规模前沿嵌入模型的聚类方法,且训练仅需少量LLM查询。本工作为以下研究方向做出贡献:(i)提出一种师生流水线,将稀疏的LLM监督信号蒸馏为轻量级模型用于主题发现;(ii)分析采样策略对改善局部几何结构以提升聚类可分离性的有效性;(iii)提供一种适用于网络规模文本分析的有效方法,使研究人员和从业者能够通过可解释、可本地部署的框架在线追踪细微主张与子主题。