Extraction of tabulated statistical results with tableParser

Tabulated content is omnipresent in scientific literature. This work presents the R package *tableParser*, designed to extract and postprocess tables from NISO-JATS-encoded XML, HTML, DOCX, and, with limitations, PDF documents. *tableParser* focuses on extracting and analyzing statistical test results reported in scientific publications. It can be used for large-scale analysis of effect sizes, reporting practices, or summarization of results, as well as for checking completeness and consistency of standard test results in unpublished documents. Documents can be processed in three decoding levels. *table2matrix()* compiles all tables into a list of character matrices with captions and footnotes. *table2text()* collapses the matrix contents into human-readable text, mimicking a screen reader. Optionally, many common codings that are reported within the table's caption and footnote can be used to decode and expand the table's content. The collapsed and decoded table content can be further processed match an ideal input for the extraction of statistical standard results with the *standardStats()* function from the *JATSdecoder* package. The output of *table2stats()* is a data frame with all detected standard results as columns and, if calculation is possible, a recalculated p-value. If desired, an automated consistency check of the reported and the coded p-values with the recalculated p-value can be initiated. *tableParser* works best on barrier-free HTML tables encoded in NISO-JATS, where captions and footnotes are clearly identifiable. By guessing the tables captions and footnotes conservatively, the processing of tables within HTML and DOCX documents is comparably robust. Technically, tables in PDFs often fail to be correctly extracted, with captions and footnotes not detectable. Therefore, a decoding of codes is not possible, which lowers *tableParser*'s decoding accuracy on PDFs.

翻译：表格内容在科学文献中无处不在。本工作介绍了R包*tableParser*，旨在从NISO-JATS编码的XML、HTML、DOCX文档以及（存在一定限制的）PDF文档中提取和后续处理表格。*tableParser*专注于提取和分析科学出版物中报告的统计检验结果。它可用于大规模分析效应量、报告实践或结果总结，也可用于检查未发表文档中标准测试结果的完整性和一致性。文档可通过三种解码层级进行处理。*table2matrix()*函数将所有表格编译为带有标题和脚注的字符矩阵列表。*table2text()*函数将矩阵内容压缩为人类可读文本，模拟屏幕阅读器的效果。可选地，可利用表格标题和脚注中报告的多种常见编码对表格内容进行解码和扩展。压缩并解码后的表格内容可进一步处理，以匹配使用*JATSdecoder*包中*standardStats()*函数提取统计标准结果的理想输入格式。*table2stats()*的输出是一个数据框，包含所有检测到的标准结果作为列，以及（在可计算的情况下）重新计算的p值。如有需要，可启动对报告p值、编码p值与重新计算p值一致性的自动检查。*tableParser*在编码为NISO-JATS的无障碍HTML表格上表现最佳，此类表格的标题和脚注清晰可辨。通过保守估计表格标题和脚注，对HTML和DOCX文档中表格的处理具有相当的稳健性。从技术角度看，PDF中的表格往往难以正确提取，且标题和脚注无法检测。因此无法进行编码解码，这降低了*tableParser*在PDF上的解码精度。