Detecting and tracking emerging trends and weak signals in large, evolving text corpora is vital for applications such as monitoring scientific literature, managing brand reputation, surveilling critical infrastructure and more generally to any kind of text-based event detection. Existing solutions often fail to capture the nuanced context or dynamically track evolving patterns over time. BERTrend, a novel method, addresses these limitations using neural topic modeling in an online setting. It introduces a new metric to quantify topic popularity over time by considering both the number of documents and update frequency. This metric classifies topics as noise, weak, or strong signals, flagging emerging, rapidly growing topics for further investigation. Experimentation on two large real-world datasets demonstrates BERTrend's ability to accurately detect and track meaningful weak signals while filtering out noise, offering a comprehensive solution for monitoring emerging trends in large-scale, evolving text corpora. The method can also be used for retrospective analysis of past events. In addition, the use of Large Language Models together with BERTrend offers efficient means for the interpretability of trends of events.
翻译:在大规模、动态演化的文本语料库中检测并追踪新兴趋势与微弱信号,对于科学文献监测、品牌声誉管理、关键基础设施监控以及更广泛的基于文本的事件检测等应用至关重要。现有方法往往难以捕捉细微的上下文信息或动态追踪随时间演化的模式。BERTrend作为一种新颖方法,通过在在线环境中运用神经主题建模来解决这些局限性。该方法引入了一种新的度量指标,通过同时考虑文档数量和更新频率来量化主题随时间的热度。该指标将主题分类为噪声、弱信号或强信号,并对新兴且快速增长的主题进行标记以供进一步研究。在两个大规模真实数据集上的实验表明,BERTrend能够准确检测并追踪有意义的弱信号,同时有效过滤噪声,为大规模动态文本语料库中的新兴趋势监控提供了全面的解决方案。该方法亦可用于对过往事件的回顾性分析。此外,将大型语言模型与BERTrend结合使用,为事件趋势的可解释性提供了高效途径。