The present study proposes a novel method of trend detection and visualization - more specifically, modeling the change in a topic over time. Where current models used for the identification and visualization of trends only convey the popularity of a singular word based on stochastic counting of usage, the approach in the present study illustrates the popularity and direction that a topic is moving in. The direction in this case is a distinct subtopic within the selected corpus. Such trends are generated by modeling the movement of a topic by using k-means clustering and cosine similarity to group the distances between clusters over time. In a convergent scenario, it can be inferred that the topics as a whole are meshing (tokens between topics, becoming interchangeable). On the contrary, a divergent scenario would imply that each topics' respective tokens would not be found in the same context (the words are increasingly different to each other). The methodology was tested on a group of articles from various media houses present in the 20 Newsgroups dataset.
翻译:本研究提出了一种新颖的趋势检测与可视化方法——具体而言,是对话题随时间变化的动态建模。当前用于趋势识别与可视化的模型仅能通过随机计数统计单一词汇的使用频次来反映其流行度,而本研究提出的方法则能同时展示话题的流行度及其演变方向。此处的"方向"指代所选语料库中某个明确的子话题。这些趋势通过以下方式生成:利用k-means聚类与余弦相似度对话题移动轨迹建模,从而分组衡量聚类簇间的时序距离。在收敛场景中,可推断整体话题正在融合(话题间的语符逐渐可互换);反之,发散场景则意味着各话题的相应语符不会出现在相同语境中(词汇间差异性持续增大)。本方法已在20 Newsgroups数据集中多家媒体机构的文章组上完成测试验证。