Lifelong learning has recently attracted attention in building machine learning systems that continually accumulate and transfer knowledge to help future learning. Unsupervised topic modeling has been popularly used to discover topics from document collections. However, the application of topic modeling is challenging due to data sparsity, e.g., in a small collection of (short) documents and thus, generate incoherent topics and sub-optimal document representations. To address the problem, we propose a lifelong learning framework for neural topic modeling that can continuously process streams of document collections, accumulate topics and guide future topic modeling tasks by knowledge transfer from several sources to better deal with the sparse data. In the lifelong process, we particularly investigate jointly: (1) sharing generative homologies (latent topics) over lifetime to transfer prior knowledge, and (2) minimizing catastrophic forgetting to retain the past learning via novel selective data augmentation, co-training and topic regularization approaches. Given a stream of document collections, we apply the proposed Lifelong Neural Topic Modeling (LNTM) framework in modeling three sparse document collections as future tasks and demonstrate improved performance quantified by perplexity, topic coherence and information retrieval task.
翻译:终身学习近年来在构建能持续积累并迁移知识以辅助未来学习的机器学习系统中备受关注。无监督主题建模已被广泛用于从文档集合中发现主题。然而,由于数据稀疏性(例如小规模短文档集合),主题建模的应用面临挑战——易生成不连贯主题及次优文档表示。为此,我们提出一种面向神经主题建模的终身学习框架,该框架可连续处理文档集合流,通过多源知识迁移积累主题并引导未来主题建模任务,从而有效应对稀疏数据问题。在终身学习过程中,我们特别研究以下联合机制:(1) 跨生命周期共享生成同源性(潜在主题)以迁移先验知识;(2) 通过新型选择性数据增强、协同训练及主题正则化方法,最小化灾难性遗忘以保留过往学习成果。针对文档集合流,我们将所提出的终身神经主题建模(LNTM)框架应用于三个稀疏文档集合的未来建模任务,并通过困惑度、主题连贯性及信息检索任务验证其性能提升。