Automated narrative intelligence systems for social media monitoring face significant scalability challenges when processing continuous data streams using traditional batch clustering algorithms. We investigate the replacement of HDBSCAN (offline clustering) with online (streaming/incremental) clustering methods in a production narrative report generation pipeline. The proposed system employs a three-stage architecture (data collection, modeling, dashboard generation) that processes thousands of multilingual social media documents daily. While HDBSCAN excels at discovering hierarchical density-based clusters and handling noise, its batch-only nature necessitates complete retraining for each time window, resulting in memory constraints, computational inefficiency, and inability to adapt to evolving narratives in real-time. This work evaluates a bunch of online clustering algorithms across dimensions of cluster quality preservation, computational efficiency, memory footprint, and integration compatibility with existing workflows. We propose evaluation criteria that balance traditional clustering metrics (Silhouette Coefficient, Davies-Bouldin Index) with narrative metrics (narrative distinctness, contingency and variance). Our methodology includes sliding-window simulations on historical datasets from Ukraine information space, enabling comparative analysis of algorithmic trade-offs in realistic operational contexts. This research addresses a critical gap between batch-oriented topic modeling frameworks and the streaming nature of social media monitoring, with implications for computational social science, crisis informatics, and narrative surveillance systems.
翻译:面向社交媒体监测的自动化叙事智能系统在使用传统批量聚类算法处理连续数据流时面临显著的可扩展性挑战。本研究探讨在生产级叙事报告生成流程中,用在线(流式/增量)聚类方法替代HDBSCAN(离线聚类)的可行性。所提出的系统采用三阶段架构(数据采集、建模、仪表板生成),每日处理数千份多语言社交媒体文档。尽管HDBSCAN在发现基于密度的层次化聚类和处理噪声方面表现优异,但其纯批处理特性要求对每个时间窗口进行完整重新训练,导致内存限制、计算效率低下,且无法实时适应演化中的叙事。本研究从聚类质量保持、计算效率、内存占用及与现有工作流集成兼容性等多个维度评估了一系列在线聚类算法。我们提出了平衡传统聚类指标(轮廓系数、戴维森堡丁指数)与叙事指标(叙事区分度、叙事连续性与方差)的评估标准。研究方法包括对乌克兰信息空间历史数据集进行滑动窗口模拟,从而在现实操作情境中对算法权衡进行对比分析。此项研究填补了面向批处理的主题建模框架与流式社交媒体监测之间的关键空白,对计算社会科学、危机信息学及叙事监测系统具有重要启示。