The proliferation of data and text documents such as articles, web pages, books, social network posts, etc. on the Internet has created a fundamental challenge in various fields of text processing under the title of "automatic text summarisation". Manual processing and summarisation of large volumes of textual data is a very difficult, expensive, time-consuming and impossible process for human users. Text summarisation systems are divided into extractive and abstract categories. In the extractive summarisation method, the final summary of a text document is extracted from the important sentences of the same document without any modification. In this method, it is possible to repeat a series of sentences and to interfere with pronouns. However, in the abstract summarisation method, the final summary of a textual document is extracted from the meaning and significance of the sentences and words of the same document or other documents. Many of the works carried out have used extraction methods or abstracts to summarise the collection of web documents, each of which has advantages and disadvantages in the results obtained in terms of similarity or size. In this work, a crawler has been developed to extract popular text posts from the Instagram social network with appropriate preprocessing, and a set of extraction and abstraction algorithms have been combined to show how each of the abstraction algorithms can be used. Observations made on 820 popular text posts on the social network Instagram show the accuracy (80%) of the proposed system.
翻译:随着互联网上数据及文本文档(如文章、网页、书籍、社交网络帖子等)的激增,在文本处理各领域出现了一个名为“自动文本摘要”的根本性挑战。对人类用户而言,人工处理并摘要大量文本数据是一项极其困难、昂贵、耗时且不可能完成的任务。文本摘要系统分为抽取式摘要和生成式摘要两类。在抽取式摘要方法中,文本文档的最终摘要是从同一文档的重要句子中直接提取,不做任何修改。该方法可能导致句子重复及代词指代干扰问题。而在生成式摘要方法中,文本文档的最终摘要则基于同一文档或其他文档中句子与词语的意义和重要性生成。现有许多工作采用抽取式或生成式方法对网络文档集合进行摘要,每种方法在所得结果的相似性或规模方面各有优劣。本研究中,我们开发了一个爬虫程序,用于从Instagram社交网络中提取热门文本帖子并进行适当预处理,同时结合了一系列抽取式与生成式算法,以展示每种生成式算法的应用方式。对Instagram社交网络820条热门文本帖子的观测结果表明,所提系统的准确率达到80%。