Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.
翻译:利用自然语言处理技术检测文本中的显著信息,已被广泛用于缓解信息过载问题。然而,现有针对该任务的大多数数据集主要来源于学术出版物。我们提出SPACE-IDEAS数据集,用于从与空间领域相关的创新想法中检测显著信息。该数据集的文本类型多样,涵盖非正式、技术性、学术性及商业写作风格。除人工标注的数据集外,我们还发布了一个利用大型生成语言模型进行标注的扩展版本。我们训练了不同级别的句子分类器与序列句子分类器,并证明通过多任务学习可有效利用自动标注数据集来训练性能更优的分类器。