News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Introduction to Deep Learning Natural Language Processing Neural Network
Introduction to Natural Language Processing Neural Network Model
main content
Input encoding for natural language tasks
Feed-forward networks
Convolutional networks,
Recurrent networks
Recursive networks
The computation graph abstraction for automatic gradient computation
Terminology in neural networks
Feature:concrete, linguistic input
Input vector: the actual input that is fed to the neural-network classifier
Input vector entry: a specific value of the input
Mathematics Symbol
Bold uppercase letters (X, Y, Z) for matrix
Bold lowercase letters represent vectors b
W1 represents the first layer in the neural network
(W)2 (W1)2 power of the matrix
[v1;v2] indicates vector convergence
Neural network architecture
Feed-forward networks
Include networks with fully connected layers, such as the multi-layer perceptron, as well as networks with convolutional and pooling layers. All of the networks act as classifiers, but each with different strengths.
Fully connected feed-forward neural networks
Can be used as a drop-in replacement (alternative) wherever a linear learner is used
The non-linearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy (non-linear network and easy to integrate pre-exercise word embedding ability, usually can improve the accuracy of classification)
As a classifier replacement,provide benefits for CCG supertagging, dialog state tracking, pre-ordering for statistical machine translation and language modeling.
Networks with convolutional and pooling layers
Are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input
Allow the model to learn to find such local indicators,regardless of their position.
Show promising results (promising results) on many tasks, including document classification, short-text categorization, sentiment classification, relationship type classification between entities, event detection, paraphrase identification (definition recognition), semantic role labeling, question answering, prediction box-office revenues of movies based on critic reviews, modeling text interestingness, and modeling The relation between character-sequences and part-of-speech tags (mimigates the relationship between character sequences and part-of-speech tags).
In order to encode arbitrary large items as fixed size vectors capturing their most salient features, it sacrifices most of the structural information.
More preferably, Recurrent and recursive architectures, allow us to work with sequences and trees while preserving a lot of the structural information
Recursive networks
Designed to model sequences
Can produce very strong results for language modeling, sequence tagging, machine translation, dependency parsing, sentiment analysis, noisy text normalization, dialog state tracking, response generation, modeling the Relation between character sequences and part-of-speech tags
Recurrent networks
Can handle trees
Can produce state-of-the-art or near state-of-the-art results for constituency and dependency parse re-ranking, discourse parsing (semantic relation classification), semantic relation classification Classification), political ideology detection based on parse trees, sentiment classification, target-dependent sentiment classification, question answering.
Feature representation
a feed-forward neural network is a function NN(x) whose input is a din dimensional vector x and the output is doutdimensional output vector
As a function of a classifier, assigning the input x a degree of membership in one or more of dout classes (for assigning x to assign a degree of membership in one or more dout classes)
Each feature is no longer represented as a unique dimension but as a dense vector. That is, each center feature is embedded in the d-dimensional space and represented as a vector of the space. This embedding can then be exercised like other parameters of the function NN.
The fundamental architecture of the NLP classification system based on feedforward neural networks is as follows:
Extract a series of central speech features related to predictive inputs f1,...,fk
For each feature fi of interest, retrieve the corresponding vector v(fi)
Combine these vectors into an input vector x
Feed x into a nonlinear classifier
The key to this construction is to use dense feature vectors instead of dense feature vectors, using central features rather than feature combinations.
The representations of these central features are mainly dense and dense representations of one-hot and Dense Vectors, as shown above. The main differences are:
1) one-hot: Each dimension represents a feature whose feature value is two-dimensional, that is, if the feature value exists, it is 1, and if it does not exist, it is 0. Each feature has its own dimensions.
2) Dense Vectors: Each center feature represents a vector, and each input vector X corresponds to several input vector entries. The mapping of features to vectors comes from the embedded table. Each feature is a d-dimensional vector, and the exercise will make similar features have similar vectors.
Application scenario
1) If we have relatively few different features in a category, and we believe that there is no relationship between the different features, then one-hot is used.
2) If we believe that there is a connection between different characteristics of a group, then Dense Vectors should be used at this time, which enables the neural network to distinguish the correlation between features and obtain some statistical strength through shared parameters.
Variable number of features
In many cases, the number of features we didn't know early. Thus, we need to represent a set of an infinite number of features with a fixed length vector. One way is CBOW (continuous word bag), which is very similar to the traditional word bag representation but discards the second message.
CBOW(f1,...,fk) =1k1k \frac{1}{k}
1
k
K1 ∑i=ki=1v(fi)∑i=1i=kv(fi) \sum_{i=1}^{i=k} v(f_{i})
∑
i
=
1
i
=
k
v
(
f
i
)
∑i=1i=kv(fi)
A variant of CBOW representation is the weight CBOW, and different vectors have different weights:
WCBOW(f1,...,fk) =1∑i=ki=1ai1∑i=1i=kai \frac{1}{\sum_{i=1}^{i=k}a_{i}}
1
∑
i
=
1
i
=
k
a
i
∑i=1i=kai1 ∑i=ki=1aiv(fi)∑i=1i=kaiv(fi) \sum_{i=1}^{i=k} a_{i}v(f_{i})
∑
i
=
1
i
=
k
a
i
v
(
f
i
)
∑i=1i=kaiv(fi)
Aiai a_{i}
a
i
Ai: indicates the relative importance of different features. For example, aiai a_{i}
a
i
Ai can be TF-IDF score
Interval and positional features
The leading interval of two words in a sentence can be seen as an informative feature
Interval feature coding: divide the interval into groups, each group (bin) is associated with a d-dimensional vector (get the distance-embedding vector), and then exercise these vectors as regular parameters in the network.
Feature combination
In neural networks, feature extraction only touches central features
In the traditional linear model-based neural network, the utility designer does not only need to manually specify the central features of the sense of interest, but also specify the interrelationship between these features.
Feature combinations are important in linear models because they introduce more dimensions into the input, making the indifference data points more linear and separable. ?
The nonlinear classifier defined in the neural network is used to find the specified feature combination, reducing the amount of engineering.
-Kernel methods allow the utility designer to specify only the central features, leaving the aspects of the feature combination to the learning algorithm, which allows for accurate optimization problems. However, its computational complexity is linear with the amount of data being exercised and is therefore not suitable for large data sets. The complexity of the classification of neural networks is independent of the data size of the exercise, but only linearly related to the size of the neural network.
Dimensionality (dimension)
The dimension grows as the number of class members increases.
Since dimensions have a direct impact on memory requirements and processing time, try several different sizes and choose the best one for speed and mission accuracy.
Vector Sharing
In some cases, some features share the same vocabulary.
Input encoding for natural language tasks
Feed-forward networks
Convolutional networks,
Recurrent networks
Recursive networks
The computation graph abstraction for automatic gradient computation
Terminology in neural networks
Feature:concrete, linguistic input
Input vector: the actual input that is fed to the neural-network classifier
Input vector entry: a specific value of the input
Mathematics Symbol
Bold uppercase letters (X, Y, Z) for matrix
Bold lowercase letters represent vectors b
W1 represents the first layer in the neural network
(W)2 (W1)2 power of the matrix
[v1;v2] indicates vector convergence
Neural network architecture
Feed-forward networks
Include networks with fully connected layers, such as the multi-layer perceptron, as well as networks with convolutional and pooling layers. All of the networks act as classifiers, but each with different strengths.
Fully connected feed-forward neural networks
Can be used as a drop-in replacement (alternative) wherever a linear learner is used
The non-linearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy (non-linear network and easy to integrate pre-exercise word embedding ability, usually can improve the accuracy of classification)
As a classifier replacement,provide benefits for CCG supertagging, dialog state tracking, pre-ordering for statistical machine translation and language modeling.
Networks with convolutional and pooling layers
Are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input
Allow the model to learn to find such local indicators,regardless of their position.
Show promising results (promising results) on many tasks, including document classification, short-text categorization, sentiment classification, relationship type classification between entities, event detection, paraphrase identification (definition recognition), semantic role labeling, question answering, prediction box-office revenues of movies based on critic reviews, modeling text interestingness, and modeling The relation between character-sequences and part-of-speech tags (mimigates the relationship between character sequences and part-of-speech tags).
In order to encode arbitrary large items as fixed size vectors capturing their most salient features, it sacrifices most of the structural information.
More preferably, Recurrent and recursive architectures, allow us to work with sequences and trees while preserving a lot of the structural information
Recursive networks
Designed to model sequences
Can produce very strong results for language modeling, sequence tagging, machine translation, dependency parsing, sentiment analysis, noisy text normalization, dialog state tracking, response generation, modeling the Relation between character sequences and part-of-speech tags
Recurrent networks
Can handle trees
Can produce state-of-the-art or near state-of-the-art results for constituency and dependency parse re-ranking, discourse parsing (semantic relation classification), semantic relation classification Classification), political ideology detection based on parse trees, sentiment classification, target-dependent sentiment classification, question answering.
Feature representation
a feed-forward neural network is a function NN(x) whose input is a din dimensional vector x and the output is doutdimensional output vector
As a function of a classifier, assigning the input x a degree of membership in one or more of dout classes (for assigning x to assign a degree of membership in one or more dout classes)
Each feature is no longer represented as a unique dimension but as a dense vector. That is, each center feature is embedded in the d-dimensional space and represented as a vector of the space. This embedding can then be exercised like other parameters of the function NN.
The fundamental architecture of the NLP classification system based on feedforward neural networks is as follows:
Extract a series of central speech features related to predictive inputs f1,...,fk
For each feature fi of interest, retrieve the corresponding vector v(fi)
Combine these vectors into an input vector x
Feed x into a nonlinear classifier
The key to this construction is to use dense feature vectors instead of dense feature vectors, using central features rather than feature combinations.
The representations of these central features are mainly dense and dense representations of one-hot and Dense Vectors, as shown above. The main differences are:
1) one-hot: Each dimension represents a feature whose feature value is two-dimensional, that is, if the feature value exists, it is 1, and if it does not exist, it is 0. Each feature has its own dimensions.
2) Dense Vectors: Each center feature represents a vector, and each input vector X corresponds to several input vector entries. The mapping of features to vectors comes from the embedded table. Each feature is a d-dimensional vector, and the exercise will make similar features have similar vectors.
Application scenario
1) If we have relatively few different features in a category, and we believe that there is no relationship between the different features, then one-hot is used.
2) If we believe that there is a connection between different characteristics of a group, then Dense Vectors should be used at this time, which enables the neural network to distinguish the correlation between features and obtain some statistical strength through shared parameters.
Variable number of features
In many cases, the number of features we didn't know early. Thus, we need to represent a set of an infinite number of features with a fixed length vector. One way is CBOW (continuous word bag), which is very similar to the traditional word bag representation but discards the second message.
CBOW(f1,...,fk) =1k1k \frac{1}{k}
1
k
K1 ∑i=ki=1v(fi)∑i=1i=kv(fi) \sum_{i=1}^{i=k} v(f_{i})
∑
i
=
1
i
=
k
v
(
f
i
)
∑i=1i=kv(fi)
A variant of CBOW representation is the weight CBOW, and different vectors have different weights:
WCBOW(f1,...,fk) =1∑i=ki=1ai1∑i=1i=kai \frac{1}{\sum_{i=1}^{i=k}a_{i}}
1
∑
i
=
1
i
=
k
a
i
∑i=1i=kai1 ∑i=ki=1aiv(fi)∑i=1i=kaiv(fi) \sum_{i=1}^{i=k} a_{i}v(f_{i})
∑
i
=
1
i
=
k
a
i
v
(
f
i
)
∑i=1i=kaiv(fi)
Aiai a_{i}
a
i
Ai: indicates the relative importance of different features. For example, aiai a_{i}
a
i
Ai can be TF-IDF score
Interval and positional features
The leading interval of two words in a sentence can be seen as an informative feature
Interval feature coding: divide the interval into groups, each group (bin) is associated with a d-dimensional vector (get the distance-embedding vector), and then exercise these vectors as regular parameters in the network.
Feature combination
In neural networks, feature extraction only touches central features
In the traditional linear model-based neural network, the utility designer does not only need to manually specify the central features of the sense of interest, but also specify the interrelationship between these features.
Feature combinations are important in linear models because they introduce more dimensions into the input, making the indifference data points more linear and separable. ?
The nonlinear classifier defined in the neural network is used to find the specified feature combination, reducing the amount of engineering.
-Kernel methods allow the utility designer to specify only the central features, leaving the aspects of the feature combination to the learning algorithm, which allows for accurate optimization problems. However, its computational complexity is linear with the amount of data being exercised and is therefore not suitable for large data sets. The complexity of the classification of neural networks is independent of the data size of the exercise, but only linearly related to the size of the neural network.
Dimensionality (dimension)
The dimension grows as the number of class members increases.
Since dimensions have a direct impact on memory requirements and processing time, try several different sizes and choose the best one for speed and mission accuracy.
Vector Sharing
In some cases, some features share the same vocabulary.