In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.
翻译:本文提出了一种多标签懒惰学习方法,用于处理存在复杂、结构化且标签间高度相关的大规模文档集合中的自动语义索引问题。所提方法是传统k近邻算法的演进,其利用大型自动编码器将庞大的标签空间映射至低维潜在空间,并从此潜在空间重构预测标签。我们在MEDLINE生物医学文献数据库的大规模子集上对方法进行了评估,该数据库采用医学主题词表作为受控词汇。实验中,我们提出并评估了多种文档表示方法及不同的标签自动编码器配置。