Text extraction is a highly subjective problem which depends on the dataset that one is working on and the kind of summarization details that needs to be extracted out. All the steps ranging from preprocessing of the data, to the choice of an optimal model for predictions, depends on the problem and the corpus at hand. In this paper, we describe a text extraction model where the aim is to extract word specified information relating to the semantics such that we can get all related and meaningful information about that word in a succinct format. This model can obtain meaningful results and can augment ubiquitous search model or a normal clustering or topic modelling algorithms. By utilizing new technique called two cluster assignment technique with K-means model, we improved the ontology of the retrieved text. We further apply the vector average damping technique for flexible movement of clusters. Our experimental results on a recent corpus of Covid-19 shows that we obtain good results based on main keywords.
翻译:文本抽取是一个高度主观的问题,其效果取决于所处理的数据集以及需要提取的摘要信息类型。从数据预处理到预测模型的最优选择,所有步骤均取决于具体问题和语料。本文提出一种文本抽取模型,旨在提取与语义相关的指定词汇信息,从而以简洁格式获取该词汇的所有关联且有意义的描述。该模型可获得有意义的结果,并能增强通用搜索模型、常规聚类或主题建模算法的性能。通过采用一种称为K-means模型双聚类分配技术的新方法,我们改进了所抽取文本的本体结构。进一步利用向量平均阻尼技术实现聚类的灵活移动。在最新COVID-19语料上的实验表明,基于主要关键词我们取得了良好的结果。