LFTK: Handcrafted Features in Computational Linguistics

Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.

翻译：过往研究已识别出一系列丰富的人工语言特征，这些特征可能有助于多种任务。然而，由于其数量庞大，有效选择和利用现有的人工特征变得困难。加之不同研究工作中实现方式不一致的问题，目前尚未形成统一的分类方案或公认的特征命名规则，这导致了不必要的混淆。此外，大多数现有的人工特征提取库并非开源或缺乏积极维护。因此，研究人员往往需要从头构建此类提取系统。我们基于过往文献收集并分类了超过220种主流人工特征，随后针对多个任务特定数据集进行了相关性分析研究，并报告了每种特征的潜在应用场景。最后，我们以系统可扩展的方式设计了一个多语言人工语言特征提取系统，并将该系统开源，以便公众访问丰富的预实现人工特征集。我们将该系统命名为LFTK，它是同类系统中规模最大的。访问地址：github.com/brucewlee/lftk。

相关内容

Computational Linguistics

关注 846

计算语言学(Computational Linguistics)是历史最悠久的出版物，专门研究语言的计算和数学特性以及自然语言处理系统的设计和分析。这本备受推崇的季刊为大学和工业界的语言学家、计算语言学家、人工智能和机器学习研究者、认知科学家、语言专家和哲学家提供有关语言研究各个方面的计算方面的最新信息。官网地址：http://dblp.uni-trier.de/db/journals/coling/

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日