While humans can extract information from unstructured text with high precision and recall, this is often too time-consuming to be practical. Automated approaches, on the other hand, produce nearly-immediate results, but may not be reliable enough for high-stakes applications where precision is essential. In this work, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only information extraction approaches. We argue for the utility of a human-in-the-loop approach in applications where high precision is required, but purely manual extraction is infeasible. We present a framework and an accompanying tool for information extraction using weak-supervision labelling with human validation. We demonstrate our approach on three criminal justice datasets. We find that the combination of computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms fully automated baselines in terms of precision.
翻译:尽管人类能够以高精确度和召回率从非结构化文本中提取信息,但这通常耗时过长而难以实际应用。另一方面,自动化方法虽能近乎即时地产生结果,但在对精确度至关重要的高风险应用中可能不够可靠。本研究综合考虑了纯人工、人机协作以及纯机器信息抽取方法的优劣。我们论证了在需要高精确度但纯手工提取不可行的应用中,采用人在回路方法的实用性。我们提出了一套基于弱监督标注与人工验证的信息抽取框架及配套工具,并在三个刑事司法数据集上展示了该方法。研究发现,计算机速度与人类理解的结合能够实现与人工标注相媲美的精确度,且仅需极短时间,同时在精确度方面显著优于全自动化基线方法。