News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Natural language processing flow in artificial intelligence
Natural language processing flow in artificial intelligence
The first step: get the corpus
Corpus, the language material. The corpus is the content of linguistic research. The corpus is the basic unit that constitutes the corpus. Therefore, people simply use text as an alternative and use the contextual relationship in the text as a substitute for the contextual relationship of the language in the real world. We refer to a collection of texts as Corpus. When there are several such collections of texts, we call them Corpora. (Definition source: Baidu Encyclopedia) According to the source of the corpus, we divide the corpus into the following two categories:
Existing corpus
Many business departments, companies and other organizations will accumulate a large amount of paper or electronic texts as the business develops. Then, for these materials, we can integrate them slightly under the conditions allowed, and electronically use the paper text as our corpus.
2. Online download and crawl corpus
What if there is no data in the hands of the individual? At this time, we can choose to obtain standard open data sets at home and abroad, such as Chinese-language Chinese-language vocabulary and People's Daily corpus. Because foreign countries are mostly English or foreign languages, they are not used here. You can also choose to crawl some data yourself through the crawler and then proceed with the follow-up content.
The second step: corpus pretreatment
Here we focus on the preprocessing of corpus. In a complete Chinese natural language processing engineering application, corpus preprocessing will probably account for 50%-70% of the total workload, so developers spend most of their time on corpus preprocessing. . In the following, the preprocessing of the corpus is completed by four major aspects: data washing, word segmentation, part-of-speech tagging, and de-stopping words.
Corpus cleaning
Data cleaning, as the name implies, finds what we are interested in in the corpus, and cleans and deletes content that is not of interest and is considered as noise, including extracting the title, abstract, body and other information for the original text, and removing the advertisement for the content of the crawled webpage. , tags, HTML, JS and other code and comments. Common data cleaning methods include: manual de-duplication, alignment, deletion, and labeling, or rule extraction, regular expression matching, word-based and named entity extraction, scripting, or code batching.
2. word segmentation
Chinese corpus data is a collection of short or long texts, such as sentences, abstracts, paragraphs, or a collection of entire articles. Generally, the words and words between sentences and paragraphs are continuous and have certain meanings. For text mining analysis, we hope that the minimum unit granularity of text processing is words or words, so this time we need word segmentation to segment all the text.
Common word segmentation algorithms are: word segmentation based on string matching, word segmentation based on understanding, word segmentation based on statistics and word segmentation based on rules. Each method corresponds to many specific algorithms.
The main difficulties of the current Chinese word segmentation algorithm are ambiguity recognition and new word recognition. For example: "Badminton auction is over", this can be divided into "badminton auction completed", or can be divided into "badminton auction completed", if you do not rely on other sentences in the context I am afraid it is difficult to know how to understand.
3. part of speech tagging
Part-of-speech tagging is the tagging of words or words for each word or word, such as adjectives, verbs, nouns, etc. Doing so allows the text to incorporate more useful language information into subsequent processing. Part-of-speech tagging is a classic sequence tagging problem, but for some Chinese natural language processing, part-of-speech tagging is not optional. For example, common text categorization does not care about part of speech, but similar to sentiment analysis and knowledge reasoning is needed. The following figure is a common Chinese part of speech.
Common part-of-speech tagging methods can be divided into rule-based and statistics-based methods. Among them, statistical-based methods, such as part-of-speech tagging based on maximum entropy, output part-of-speech based on statistical maximum probability, and part-of-speech tagging based on HMM.
4. Go to stop words
Stop words generally refer to words that do not contribute to the characteristics of the text, such as punctuation, tone, and person. So in general text processing, after the word segmentation, the next step is to stop the word. However, for Chinese, the stop word operation is not static. The stop word dictionary is determined according to the specific scene. For example, in sentiment analysis, modal particles and exclamation marks should be retained because they express tone and affection. Color has a certain contribution and meaning.
The third step: feature engineering
After the corpus preprocessing, you need to consider how to represent the words and words after the word segmentation into a type that the computer can calculate. Obviously, if we want to calculate the string we need to convert at least the Chinese word segmentation into numbers, it should be a vector in mathematics. There are two commonly used representation models, the word bag model and the word vector.
Bag of Word (BOW), that is, regardless of the order of the words in the sentence, directly put each word or symbol in a set (such as list), and then count the number of occurrences according to the counting method. . Statistical word frequency is only the most basic way, TF-IDF is a classic usage of the word bag model.
A word vector is a computational model that converts words and words into vector matrices. The most commonly used word representation method so far is One-hot, which represents each word as a very long vector. The dimension of this vector is the vocabulary size, where most of the elements are 0, and only one dimension has a value of 1, which represents the current word. There is also the Google team's Word2Vec, which mainly includes two models: Skip-Gram and Continuous Bag of Words (CBOW), and two methods of efficient training: Negative Sampling ) and the sequence Softmax (Hierarchical Softmax). It is worth mentioning that the Word2Vec word vector can better express the similarity and analogy between different words. In addition, there are some representations of word vectors, such as Doc2Vec, WordRank, and FastText.
The fourth step: feature selection
As with data mining, feature engineering is also essential in text mining related issues. In a practical problem, the constructed feature vector is to choose the appropriate and expressive features. Text features are generally words with semantic information. Feature selection can be used to find a feature subset, which can still retain semantic information. However, the feature subspace found by feature extraction will lose some semantic information. So feature selection is a very challenging process, more dependent on experience and expertise, and there are many off-the-shelf algorithms for feature selection. At present, there are six common feature selection methods: DF, MI, IG, CHI, WLLR, and WFO.
Step 5: Model Training
After the feature vector is selected, the next thing to do is of course the training model. For different application requirements, we use different models, traditional supervised and unsupervised machine learning models, such as KNN, SVM, Naive Bayes, Decision tree, GBDT, K-means and other models; deep learning models such as CNN, RNN, LSTM, Seq2Seq, FastText, TextCNN, etc. These models are used in subsequent examples of classification, clustering, neural sequence, sentiment analysis, etc., and will not be described here. Here are a few points to note when training your model.
Corpus, the language material. The corpus is the content of linguistic research. The corpus is the basic unit that constitutes the corpus. Therefore, people simply use text as an alternative and use the contextual relationship in the text as a substitute for the contextual relationship of the language in the real world. We refer to a collection of texts as Corpus. When there are several such collections of texts, we call them Corpora. (Definition source: Baidu Encyclopedia) According to the source of the corpus, we divide the corpus into the following two categories:
Existing corpus
Many business departments, companies and other organizations will accumulate a large amount of paper or electronic texts as the business develops. Then, for these materials, we can integrate them slightly under the conditions allowed, and electronically use the paper text as our corpus.
2. Online download and crawl corpus
What if there is no data in the hands of the individual? At this time, we can choose to obtain standard open data sets at home and abroad, such as Chinese-language Chinese-language vocabulary and People's Daily corpus. Because foreign countries are mostly English or foreign languages, they are not used here. You can also choose to crawl some data yourself through the crawler and then proceed with the follow-up content.
The second step: corpus pretreatment
Here we focus on the preprocessing of corpus. In a complete Chinese natural language processing engineering application, corpus preprocessing will probably account for 50%-70% of the total workload, so developers spend most of their time on corpus preprocessing. . In the following, the preprocessing of the corpus is completed by four major aspects: data washing, word segmentation, part-of-speech tagging, and de-stopping words.
Corpus cleaning
Data cleaning, as the name implies, finds what we are interested in in the corpus, and cleans and deletes content that is not of interest and is considered as noise, including extracting the title, abstract, body and other information for the original text, and removing the advertisement for the content of the crawled webpage. , tags, HTML, JS and other code and comments. Common data cleaning methods include: manual de-duplication, alignment, deletion, and labeling, or rule extraction, regular expression matching, word-based and named entity extraction, scripting, or code batching.
2. word segmentation
Chinese corpus data is a collection of short or long texts, such as sentences, abstracts, paragraphs, or a collection of entire articles. Generally, the words and words between sentences and paragraphs are continuous and have certain meanings. For text mining analysis, we hope that the minimum unit granularity of text processing is words or words, so this time we need word segmentation to segment all the text.
Common word segmentation algorithms are: word segmentation based on string matching, word segmentation based on understanding, word segmentation based on statistics and word segmentation based on rules. Each method corresponds to many specific algorithms.
The main difficulties of the current Chinese word segmentation algorithm are ambiguity recognition and new word recognition. For example: "Badminton auction is over", this can be divided into "badminton auction completed", or can be divided into "badminton auction completed", if you do not rely on other sentences in the context I am afraid it is difficult to know how to understand.
3. part of speech tagging
Part-of-speech tagging is the tagging of words or words for each word or word, such as adjectives, verbs, nouns, etc. Doing so allows the text to incorporate more useful language information into subsequent processing. Part-of-speech tagging is a classic sequence tagging problem, but for some Chinese natural language processing, part-of-speech tagging is not optional. For example, common text categorization does not care about part of speech, but similar to sentiment analysis and knowledge reasoning is needed. The following figure is a common Chinese part of speech.
Common part-of-speech tagging methods can be divided into rule-based and statistics-based methods. Among them, statistical-based methods, such as part-of-speech tagging based on maximum entropy, output part-of-speech based on statistical maximum probability, and part-of-speech tagging based on HMM.
4. Go to stop words
Stop words generally refer to words that do not contribute to the characteristics of the text, such as punctuation, tone, and person. So in general text processing, after the word segmentation, the next step is to stop the word. However, for Chinese, the stop word operation is not static. The stop word dictionary is determined according to the specific scene. For example, in sentiment analysis, modal particles and exclamation marks should be retained because they express tone and affection. Color has a certain contribution and meaning.
The third step: feature engineering
After the corpus preprocessing, you need to consider how to represent the words and words after the word segmentation into a type that the computer can calculate. Obviously, if we want to calculate the string we need to convert at least the Chinese word segmentation into numbers, it should be a vector in mathematics. There are two commonly used representation models, the word bag model and the word vector.
Bag of Word (BOW), that is, regardless of the order of the words in the sentence, directly put each word or symbol in a set (such as list), and then count the number of occurrences according to the counting method. . Statistical word frequency is only the most basic way, TF-IDF is a classic usage of the word bag model.
A word vector is a computational model that converts words and words into vector matrices. The most commonly used word representation method so far is One-hot, which represents each word as a very long vector. The dimension of this vector is the vocabulary size, where most of the elements are 0, and only one dimension has a value of 1, which represents the current word. There is also the Google team's Word2Vec, which mainly includes two models: Skip-Gram and Continuous Bag of Words (CBOW), and two methods of efficient training: Negative Sampling ) and the sequence Softmax (Hierarchical Softmax). It is worth mentioning that the Word2Vec word vector can better express the similarity and analogy between different words. In addition, there are some representations of word vectors, such as Doc2Vec, WordRank, and FastText.
The fourth step: feature selection
As with data mining, feature engineering is also essential in text mining related issues. In a practical problem, the constructed feature vector is to choose the appropriate and expressive features. Text features are generally words with semantic information. Feature selection can be used to find a feature subset, which can still retain semantic information. However, the feature subspace found by feature extraction will lose some semantic information. So feature selection is a very challenging process, more dependent on experience and expertise, and there are many off-the-shelf algorithms for feature selection. At present, there are six common feature selection methods: DF, MI, IG, CHI, WLLR, and WFO.
Step 5: Model Training
After the feature vector is selected, the next thing to do is of course the training model. For different application requirements, we use different models, traditional supervised and unsupervised machine learning models, such as KNN, SVM, Naive Bayes, Decision tree, GBDT, K-means and other models; deep learning models such as CNN, RNN, LSTM, Seq2Seq, FastText, TextCNN, etc. These models are used in subsequent examples of classification, clustering, neural sequence, sentiment analysis, etc., and will not be described here. Here are a few points to note when training your model.