【推荐】中文处理(BiLSTM分词)工具包FoolNLTK

2017 年 12 月 27 日 机器学习研究会

点击上方 “机器学习研究会”可以订阅

摘要

转自：爱可可-爱生活

中文处理工具包

特点

可能不是最快的开源中文分词，但很可能是最准的开源中文分词
基于BiLSTM模型训练而成
包含分词，词性标注，实体识别,　都有比较高的准确率
用户自定义词典

Install

pip install foolnltk

使用说明

分词

import fool

text = "一个傻子在北京"
print(fool.cut(text))
# ['一个', '傻子', '在', '北京']

命令行分词

python -m fool [filename]

用户自定义词典

词典格式格式如下，词的权重越高，词的长度越长就越越可能出现,　权重值请大于1

难受香菇 10
什么鬼 10
分词工具 10
北京 10
北京天安门 10

加载词典

import fool
fool.load_userdict(path)
text = "我在北京天安门看你难受香菇"print(fool.cut(text))# ['我', '在', '北京天安门', '看', '你', '难受香菇']

删除词典

fool.delete_userdict();

词性标注

import fool

text = "一个傻子在北京"
print(fool.pos_cut(text))
#[('一个', 'm'), ('傻子', 'n'), ('在', 'p'), ('北京', 'ns')]

实体识别

import fool 

text = "一个傻子在北京"
words, ners = fool.analysis(text)
print(ners)
#[(5, 8, 'location', '北京')]

注意

暂时只在Python3 Linux 平台测试通过

链接：

https://github.com/rockyzhengwu/FoolNLTK

原文链接：

https://m.weibo.cn/1402400261/4188834948484282

“完整内容”请点击【阅读原文】

↓↓↓

登录查看更多

相关内容

分词

关注 10

将一个汉字序列切分成一个一个单独的词

还在修改博士论文？这份《博士论文写作技巧》为你指南

专知会员服务

166+阅读 · 2020年6月9日

哈工大最新综述，基于文档的对话系统，30页pdf跟踪最新领域前沿

专知会员服务

91+阅读 · 2020年5月8日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

109+阅读 · 2020年5月1日

深度学习自然语言处理概述，216页ppt，Jindřich Helcl

专知会员服务

216+阅读 · 2020年4月26日

【InterSpeech2020】混合语音识别系统中的词汇扩展技术，Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

专知会员服务

17+阅读 · 2020年3月23日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【AAAI2020-清华大学】张量图卷积网络文本分类，Tensor Graph Convolutional Networks for Text Classification

专知会员服务

76+阅读 · 2020年1月16日

专知会员服务

15+阅读 · 2019年11月24日

【CCF优秀博士学位论文奖-2019】大规模图数据处理系统的设计与实现，清华大学朱晓伟

专知会员服务

52+阅读 · 2019年11月8日

【CLL 2019】汉语复合名词短语语义关系知识库构建与自动识别研究

专知会员服务

17+阅读 · 2019年10月18日

Python中文分词工具大合集：安装、使用和测试

AINLP

11+阅读 · 2019年5月13日

中文分词工具在线PK新增：FoolNLTK、LTP、StanfordCoreNLP

AINLP

13+阅读 · 2019年5月5日

五款中文分词工具在线PK: Jieba, SnowNLP, PkuSeg, THULAC, HanLP

AINLP

13+阅读 · 2019年5月1日

分词那些事儿

AINLP

6+阅读 · 2019年3月26日

Jiagu：中文深度学习自然语言处理工具

AINLP

90+阅读 · 2019年2月20日

北大开源了中文分词工具包，准确度远超Jieba，提供三个预训练模型

量子位

5+阅读 · 2019年1月9日

北大开源全新中文分词工具包：准确率远超THULAC、结巴分词

机器之心

6+阅读 · 2019年1月9日

word2vec中文语料训练

全球人工智能

13+阅读 · 2018年4月23日

FoolNLTK：可能是目前最准的中文分词工具 | 软件推介

开源中国

7+阅读 · 2017年12月23日

HULAC：一个高效的中文词法分析工具包（清华）

全球人工智能

5+阅读 · 2017年11月12日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Notes on Deep Learning for NLP

Arxiv

22+阅读 · 2018年8月30日

Chinese NER Using Lattice LSTM

Arxiv

14+阅读 · 2018年5月15日

A Tidy Data Model for Natural Language Processing using cleanNLP

Arxiv

4+阅读 · 2018年5月3日

Sentiment Transfer using Seq2Seq Adversarial Autoencoders

Arxiv

4+阅读 · 2018年4月10日

$ρ$-hot Lexicon Embedding-based Two-level LSTM for Sentiment Analysis

Arxiv

6+阅读 · 2018年3月21日

Single-Perspective Warps in Natural Image Stitching

Arxiv

4+阅读 · 2018年2月13日

A Comparison of Word Embeddings for the Biomedical Natural Language Processing

Arxiv

3+阅读 · 2018年2月1日

SentiPers: A Sentiment Analysis Corpus for Persian

Arxiv

5+阅读 · 2018年1月23日

MatchZoo: A Toolkit for Deep Text Matching

Arxiv

5+阅读 · 2017年7月23日