Dates often contribute towards highly impactful medical decisions, but it is rarely clear how to extract this data. AI has only just begun to be used transcribe such documents, and common methods are either to trust that the output produced by a complex AI model, or to parse the text using regular expressions. Recent work has established that regular expressions are an explainable form of logic, but it is difficult to decompose these into the component parts that are required to construct precise UNIX timestamps. First, we test publicly-available regular expressions, and we found that these were unable to capture a significant number of our dates. Next, we manually created easily-decomposable regular expressions, and we found that these were able to detect the majority of real dates, but also a lot of sequences of text that look like dates. Finally, we used regular expression synthesis to automatically identify regular expressions from the reverse-engineered UNIX timestamps that we created. We find that regular expressions created by regular expression synthesis detect far fewer sequences of text that look like dates than those that were manually created, at the cost of a slight increase to the number of missed dates. Overall, our results show that regular expressions can be created through regular expression synthesis to identify complex dates and date ranges in text transcriptions. To our knowledge, our proposed way of learning deterministic logic by reverse-engineering several many-one mappings and feeding these into a regular expression synthesiser is a new approach.
翻译:日期信息常对医疗决策产生重大影响,但如何提取此类数据尚缺乏明确方法。人工智能技术刚开始被用于转录此类文档,当前常用方法要么依赖复杂AI模型的输出结果,要么通过正则表达式解析文本。近期研究证实正则表达式是一种可解释的逻辑形式,但将其分解为构建精确UNIX时间戳所需的组成部分仍存在困难。首先,我们测试了公开可用的正则表达式,发现其无法捕获大量日期数据。接着,我们手动创建了易于分解的正则表达式,发现其能检测大多数真实日期,但同时也会识别大量形似日期的文本序列。最后,我们采用正则表达式合成技术,基于已创建的逆向工程UNIX时间戳自动识别正则表达式。研究发现,通过正则表达式合成生成的正则表达式在误检形似日期的文本序列方面远少于手动创建的方法,代价是漏检日期数量略有增加。总体而言,我们的结果表明,通过正则表达式合成技术可以创建能够识别文本转录中复杂日期和日期范围的正则表达式。据我们所知,我们提出的通过逆向工程多个多对一映射并将其输入正则表达式合成器来学习确定性逻辑的方法是一种创新途径。