This paper highlights the challenges, current trends, and open issues related to the representation, querying and analytics of content extracted from texts. The internet contains vast text-based information on various subjects, including commercial documents, medical records, scientific experiments, engineering tests, and events that impact urban and natural environments. Extracting knowledge from this text involves understanding the nuances of natural language and accurately representing the content without losing information. This allows knowledge to be accessed, inferred, or discovered. To achieve this, combining results from various fields, such as linguistics, natural language processing, knowledge representation, data storage, querying, and analytics, is necessary. The vision in this paper is that graphs can be a well-suited text content representation once annotated and the right querying and analytics techniques are applied. This paper discusses this hypothesis from the perspective of linguistics, natural language processing, graph models and databases and artificial intelligence provided by the panellists of the DOING session in the MADICS Symposium 2022.
翻译:本文阐述了从文本中提取内容进行表示、查询与分析所面临的挑战、当前趋势及开放性问题。互联网包含海量基于文本的信息,涵盖商业文档、医疗记录、科学实验、工程测试以及影响城市与自然环境的各种事件。从文本中提取知识需要理解自然语言的细微差别,并准确表示内容而不丢失信息,从而使得知识能够被访问、推断或发现。为实现这一目标,需要融合语言学、自然语言处理、知识表示、数据存储、查询与分析等多个领域的研究成果。本文的愿景是:一旦文本内容经过标注并应用合适的查询与分析技术,图谱可以成为其理想的内容表示形式。本文从语言学、自然语言处理、图模型与数据库以及人工智能的角度讨论了这一假设,这些视角来自MADICS研讨会2022年DOING分论坛的与会专家。