Contributions to the Improvement of Question Answering Systems in the Biomedical Domain

This thesis work falls within the framework of question answering (QA) in the biomedical domain where several specific challenges are addressed, such as specialized lexicons and terminologies, the types of treated questions, and the characteristics of targeted documents. We are particularly interested in studying and improving methods that aim at finding accurate and short answers to biomedical natural language questions from a large scale of biomedical textual documents in English. QA aims at providing inquirers with direct, short and precise answers to their natural language questions. In this Ph.D. thesis, we propose four contributions to improve the performance of QA in the biomedical domain. In our first contribution, we propose a machine learning-based method for question type classification to determine the types of given questions which enable to a biomedical QA system to use the appropriate answer extraction method. We also propose an another machine learning-based method to assign one or more topics (e.g., pharmacological, test, treatment, etc.) to given questions in order to determine the semantic types of the expected answers which are very useful in generating specific answer retrieval strategies. In the second contribution, we first propose a document retrieval method to retrieve a set of relevant documents that are likely to contain the answers to biomedical questions from the MEDLINE database. We then present a passage retrieval method to retrieve a set of relevant passages to questions. In the third contribution, we propose specific answer extraction methods to generate both exact and ideal answers. Finally, in the fourth contribution, we develop a fully automated semantic biomedical QA system called SemBioNLQA which is able to deal with a variety of natural language questions and to generate appropriate answers by providing both exact and ideal answers.

翻译：本论文工作属于生物医学领域问答（QA）的范畴，其中涉及若干特定挑战，例如专业词汇与术语、处理问题的类型以及目标文献的特征。我们特别关注如何改进方法，用于从大规模英文生物医学文本中寻找针对自然语言问题的准确且简短答案。QA旨在为提问者提供对其自然语言问题直接、简短且精确的答案。在本博士论文中，我们提出了四项贡献以提升生物医学领域QA的性能。第一项贡献中，我们提出了一种基于机器学习的问题类型分类方法，以确定给定问题的类型，从而使生物医学QA系统能够采用适当的答案抽取方法。我们还提出另一种基于机器学习的方法，为给定问题分配一个或多个主题（例如药物学、检验、治疗等），以确定预期答案的语义类型，这对于制定特定的答案检索策略非常有用。第二项贡献中，我们首先提出一种文档检索方法，从MEDLINE数据库中检索可能包含生物医学问题答案的相关文档集；随后提出一种段落检索方法，以检索与问题相关的段落集。第三项贡献中，我们提出特定的答案抽取方法，用于生成精确答案和理想答案。最后，在第四项贡献中，我们开发了一个名为SemBioNLQA的全自动语义生物医学QA系统，该系统能够处理多种自然语言问题，并通过提供精确答案和理想答案来生成适当的回答。