Taec: a Manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. Newly desirable wheat variety traits include disease resistance to reduce pesticide use, adaptation to climate change, resistance to heat and drought stresses, or low gluten content of grains. Wheat breeding experiments are documented by a large body of scientific literature and observational data obtained in-field and under controlled conditions. The cross-referencing of complementary information from the literature and observational data is essential to the study of the genotype-phenotype relationship and to the improvement of wheat selection. The scientific literature on genetic marker-assisted selection describes much information about the genotype-phenotype relationship. However, the variety of expressions used to refer to traits and phenotype values in scientific articles is a hinder to finding information and cross-referencing it. When trained adequately by annotated examples, recent text mining methods perform highly in named entity recognition and linking in the scientific domain. While several corpora contain annotations of human and animal phenotypes, currently, no corpus is available for training and evaluating named entity recognition and entity-linking methods in plant phenotype literature. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 540 PubMed references fully annotated for trait, phenotype, and species named entities using the Wheat Trait and Phenotype Ontology and the species taxonomy of the National Center for Biotechnology Information. A study of the performance of tools trained on the Triticum aestivum trait Corpus shows that the corpus is suitable for the training and evaluation of named entity recognition and linking.

翻译：小麦品种表现出丰富的性状与表型多样性。将这些特征与遗传变异相关联，对于实现更短周期、更高效率的小麦育种计划至关重要。当前理想的小麦新品种性状包括：减少农药使用的抗病性、适应气候变化的特性、耐高温与干旱胁迫能力，以及低麸质含量。小麦育种实验的文献记录与田间及可控条件下的观测数据形成了庞大的知识库。文献与观测数据中互补信息的交叉引用，对于研究基因型-表型关系及优化小麦筛选至关重要。关于分子标记辅助选择的科研文献中包含了大量基因型-表型关系信息。然而，科学文献中用于描述性状和表型值的表达方式多样性，阻碍了信息的检索与交叉验证。在充分使用标注样本进行训练后，近年来的文本挖掘方法在科学领域的命名实体识别与链接任务中表现出色。尽管已有多个语料库包含人类和动物表型的标注数据，但目前尚无适用于植物表型文献中命名实体识别与实体链接方法训练与评估的语料库。Triticum aestivum性状语料库是面向小麦性状与表型的全新黄金标准数据集。该语料库基于540篇PubMed文献，采用小麦性状与表型本体以及美国国家生物技术信息中心的物种分类体系，对性状、表型和物种命名实体进行了完整标注。基于该语料库训练的工具性能研究表明，该数据集适用于命名实体识别与链接任务的训练及评估。