Automated Extraction of Pharmacokinetic Parameters from Structured XML Scientific Articles: Enhancing Data Accessibility at Scale

In the field of pharmacology, there is a notable absence of centralized, comprehensive, and up-to-date repositories of PK data. This poses a significant challenge for R&D as it can be a time-consuming and challenging task to collect all the required quantitative PK parameters from diverse scientific publications. This quantitative PK information is predominantly organized in tabular format, mostly available as XML, HTML, or PDF files within various online repositories and scientific publications, including supplementary materials. This makes tables one of the crucial components and information elements of scientific or regulatory documents as they are commonly utilized to present quantitative information. Extracting data from tables is typically a labor-intensive process, and alternative automated machine learning models may struggle to accurately detect and extract the relevant data due to the complex nature and diverse layouts of tabular data. The difficulty of information extraction and reading order detection is largely dependent on the structural complexity of the tables. Efforts to understand tables should prioritize capturing the content of table cells in a manner that aligns with how a human reader naturally comprehends the information. FARAD has been manually extracting tabular data and other information from literature and regulatory agencies for over 40 years. However, there is now an urgent need to automate this process due to the large volume of publications released daily. The accuracy of this task has become increasingly challenging, as manual extraction is tedious and prone to errors, especially given the staffing shortages we are currently facing. This necessitates the development of AI algorithms for table detection and extraction that are able to precisely handle cells organized according to the table structure, as indicated by column and/or row header information.

翻译：在药理学领域，目前缺乏集中、全面且实时更新的药代动力学（PK）数据库。这给研发工作带来了重大挑战，因为从不同科学出版物中收集所有必需的定量PK参数是一项耗时且艰巨的任务。这些定量PK信息主要以表格形式呈现，大多以XML、HTML或PDF文件格式存在于各类在线存储库、科学出版物（包括补充材料）中。这使得表格成为科学或监管文件的关键组成部分和信息要素，因为其常用于呈现定量信息。从表格中提取数据通常是劳动密集型过程，而替代性的自动化机器学习模型可能因表格数据的复杂性和多样化布局而难以准确检测并提取相关数据。信息提取和阅读顺序检测的难度在很大程度上取决于表格的结构复杂性。理解表格的努力应优先考虑以与人类读者自然理解信息方式一致的方式捕获表格单元格的内容。FARAD在过去40多年中一直手动从文献和监管机构提取表格数据及其他信息。然而，由于每天发布的大量出版物，目前迫切需要自动化这一过程。该任务的准确性日益具有挑战性，因为手动提取既繁琐又易出错，尤其考虑到当前面临的人员短缺问题。这要求开发用于表格检测和提取的人工智能算法，使其能够精确处理根据表头信息（如列和/或行标题）组织的单元格结构。