News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Python AI Natural Language Processing
Python AI Natural Language Processing
SnowNLP is a python-written library that handles Chinese text content with ease. Such as
Chinese word segmentation
POS tagging
Emotional analysis
Text Categorization
Extract text keywords
Text similarity calculation
Device: pip install snownlp
After completing the snownp device, view the module's directory structure as shown
Write a picture here
Normal: converts text to pinyin
Seg: Chinese word segmentation
Sentiment: emotional analysis
Sim: text similarity
Summary: Extract text summary
Tag: part of speech tagging
__init__.py: function method for the entire module
Want to understand the snownlp, can open __init__.py to see the snonlp method provided by the function
# -*- coding: utf-8 -*-
From __future__ import unicode_literals
From . import normal
From . import seg
From . import tag
From . import sentiment
From .sim import bm25
From .summary import textrank
From .summary import words_merge
Class SnowNLP(object):
Def __init__(self, doc):
Self.doc = doc
Self.bm25 = bm25.BM25(doc)
@property
Def words(self):
Return seg.seg(self.doc)
@property
Def sentences (self):
Return normal.get_sentences(self.doc)
@property
Def han(self):
Return normal.zh2hans(self.doc)
@property
Def pinyin(self):
Return normal.get_pinyin(self.doc)
@property
Def sentiments(self):
Return sentiment.classify(self.doc)
@property
Def tags(self):
Words = self.words
Tags = tag.tag(words)
Return zip(words, tags)
@property
Def tf(self):
Return self.bm25.f
@property
Def idf(self):
Return self.bm25.idf
Def sim(self, doc):
Return self.bm25.simall(doc)
Def summary(self, limit=5):
Doc = []
Sents = self.sentences
For sent in sents:
Words = seg.seg(sent)
Words = normal.filter_stop(words)
Doc.append(words)
Rank = textrank.TextRank(doc)
Rank.solve()
Ret = []
For index in rank.top_index(limit):
Ret.append(sents[index])
Return ret
Def keywords(self, limit=5, merge=False):
Doc = []
Sents = self.sentences
For sent in sents:
Words = seg.seg(sent)
Words = normal.filter_stop(words)
Doc.append(words)
Rank = textrank.KeywordTextRank(doc)
Rank.solve()
Ret = []
For w in rank.top_index(limit):
Ret.append(w)
If merge:
Wm = words_merge.SimpleMerge(self.doc, ret)
Return wm.merge()
Return ret
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
The whole snownp module provides us with these method functions. The detailed application is based on official documents.
From snownlp import SnowNLP
s = SnowNLP (u' this thing is really like it')
# Participle
S.words # [u'this', u' things', u' really',
# u'are very, u' like it']
# Word tagging
S.tags # [(u'this', u'r'), (u'things', u'n'),
# (u'truth', u'd'), (u'very', u'd'),
# (u' Like ', u'Vg')]
# Emotional Analysis
S.sentiments # 0.9769663402895832 positive probability
# Convert Pinyin
S.pinyin # [u'zhe', u'ge', u'dong', u'xi',
# u'zhen', u'xin', u'hen', u'zan']
s = SnowNLP (u''Traditional Chinese'''Traditional Chinese'' is also very common in Taiwan.')
# Conversion Simplified
S.han # u's name for "Traditional Chinese" and "Traditional Chinese"
# It is also very common in Taiwan. '
Text = u'''
Natural speech processing is an important direction in the category of computer science and artificial intelligence.
It discusses various theories and approaches that can be used to stop natural communication between humans and computers.
Natural speech processing is a science integrating verbal science, computer science, and mathematics.
Therefore, this category of research will touch on natural speech, that is, the language people use daily.
So it has a close connection with the study of speech, but there are important differences.
Natural speech processing is not an ordinary study of natural speech.
It is to develop a computer system that can effectively complete natural verbal communication.
Especially the software system. So it is a part of computer science.
'''
s = SnowNLP(text)
# Extract keywords
S.keywords(3) #[u'Spoken', u'Nature', u'Computer']
# Extract abstract
S.summary(3) # [u'so it's part of computer science',
# u's natural speech processing is a fusion of language, computer science,
# Mathematics in one science,
# u' natural speech processing is a category of computer science and artificial intelligence
# An important direction in the category']
# Text clause processing
Temp_list = s.sentences
s = SnowNLP([[''this', 'Article''],
['The article', 'Thesis'],
['This one']])
# TF-IDF algorithm
S.tf
S.idf
# Text similarity. Finding text similar to sim (['art'']) from the s object
S.sim(['article'])# [0.3756070762985226, 0, 0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
About exercise
Exercise is a better way to improve the existing corpora. Nowadays it includes segmentation, part-of-speech tagging and emotional analysis. Take the word segmentation as an example. The word segmentation is under the snownp/seg directory
#Word segmentation exercise
From snownlp import seg
Seg.train('data.txt')
Seg.save('seg.marshal')
# Speech tagging exercise
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# Emotional Analysis Exercise
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')
1
2
3
4
5
6
7
8
9
10
11
12
This exercise file is stored as seg.marshal, then correct the data_path in snownp/seg/init.py point to just exercise the file
The
Chinese word segmentation
POS tagging
Emotional analysis
Text Categorization
Extract text keywords
Text similarity calculation
Device: pip install snownlp
After completing the snownp device, view the module's directory structure as shown
Write a picture here
Normal: converts text to pinyin
Seg: Chinese word segmentation
Sentiment: emotional analysis
Sim: text similarity
Summary: Extract text summary
Tag: part of speech tagging
__init__.py: function method for the entire module
Want to understand the snownlp, can open __init__.py to see the snonlp method provided by the function
# -*- coding: utf-8 -*-
From __future__ import unicode_literals
From . import normal
From . import seg
From . import tag
From . import sentiment
From .sim import bm25
From .summary import textrank
From .summary import words_merge
Class SnowNLP(object):
Def __init__(self, doc):
Self.doc = doc
Self.bm25 = bm25.BM25(doc)
@property
Def words(self):
Return seg.seg(self.doc)
@property
Def sentences (self):
Return normal.get_sentences(self.doc)
@property
Def han(self):
Return normal.zh2hans(self.doc)
@property
Def pinyin(self):
Return normal.get_pinyin(self.doc)
@property
Def sentiments(self):
Return sentiment.classify(self.doc)
@property
Def tags(self):
Words = self.words
Tags = tag.tag(words)
Return zip(words, tags)
@property
Def tf(self):
Return self.bm25.f
@property
Def idf(self):
Return self.bm25.idf
Def sim(self, doc):
Return self.bm25.simall(doc)
Def summary(self, limit=5):
Doc = []
Sents = self.sentences
For sent in sents:
Words = seg.seg(sent)
Words = normal.filter_stop(words)
Doc.append(words)
Rank = textrank.TextRank(doc)
Rank.solve()
Ret = []
For index in rank.top_index(limit):
Ret.append(sents[index])
Return ret
Def keywords(self, limit=5, merge=False):
Doc = []
Sents = self.sentences
For sent in sents:
Words = seg.seg(sent)
Words = normal.filter_stop(words)
Doc.append(words)
Rank = textrank.KeywordTextRank(doc)
Rank.solve()
Ret = []
For w in rank.top_index(limit):
Ret.append(w)
If merge:
Wm = words_merge.SimpleMerge(self.doc, ret)
Return wm.merge()
Return ret
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
The whole snownp module provides us with these method functions. The detailed application is based on official documents.
From snownlp import SnowNLP
s = SnowNLP (u' this thing is really like it')
# Participle
S.words # [u'this', u' things', u' really',
# u'are very, u' like it']
# Word tagging
S.tags # [(u'this', u'r'), (u'things', u'n'),
# (u'truth', u'd'), (u'very', u'd'),
# (u' Like ', u'Vg')]
# Emotional Analysis
S.sentiments # 0.9769663402895832 positive probability
# Convert Pinyin
S.pinyin # [u'zhe', u'ge', u'dong', u'xi',
# u'zhen', u'xin', u'hen', u'zan']
s = SnowNLP (u''Traditional Chinese'''Traditional Chinese'' is also very common in Taiwan.')
# Conversion Simplified
S.han # u's name for "Traditional Chinese" and "Traditional Chinese"
# It is also very common in Taiwan. '
Text = u'''
Natural speech processing is an important direction in the category of computer science and artificial intelligence.
It discusses various theories and approaches that can be used to stop natural communication between humans and computers.
Natural speech processing is a science integrating verbal science, computer science, and mathematics.
Therefore, this category of research will touch on natural speech, that is, the language people use daily.
So it has a close connection with the study of speech, but there are important differences.
Natural speech processing is not an ordinary study of natural speech.
It is to develop a computer system that can effectively complete natural verbal communication.
Especially the software system. So it is a part of computer science.
'''
s = SnowNLP(text)
# Extract keywords
S.keywords(3) #[u'Spoken', u'Nature', u'Computer']
# Extract abstract
S.summary(3) # [u'so it's part of computer science',
# u's natural speech processing is a fusion of language, computer science,
# Mathematics in one science,
# u' natural speech processing is a category of computer science and artificial intelligence
# An important direction in the category']
# Text clause processing
Temp_list = s.sentences
s = SnowNLP([[''this', 'Article''],
['The article', 'Thesis'],
['This one']])
# TF-IDF algorithm
S.tf
S.idf
# Text similarity. Finding text similar to sim (['art'']) from the s object
S.sim(['article'])# [0.3756070762985226, 0, 0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
About exercise
Exercise is a better way to improve the existing corpora. Nowadays it includes segmentation, part-of-speech tagging and emotional analysis. Take the word segmentation as an example. The word segmentation is under the snownp/seg directory
#Word segmentation exercise
From snownlp import seg
Seg.train('data.txt')
Seg.save('seg.marshal')
# Speech tagging exercise
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# Emotional Analysis Exercise
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')
1
2
3
4
5
6
7
8
9
10
11
12
This exercise file is stored as seg.marshal, then correct the data_path in snownp/seg/init.py point to just exercise the file
The