News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Neural probabilistic language model for deep learning
Neural probabilistic language model for deep learning
The purpose of statistical speech modeling is to learn the combination probability function of word sequences in speech. Because of the curse of dimensionality, this is essentially difficult: the sequence of words in the model test may be different from that in the exercise set. A traditional but very successful approach based on n-gram is to achieve generalization by looking at very short stacked sequences in the convergence exercise set. We advocate learning the scattered representation of words to deal with the disaster of opposition dimension. The model stops modeling exponential semantic related sentences after training sentences. The model also learns the scattered representation of each word and the probability function of word sequence. If word sequences are made up of similar words that have been seen before (in the sense of near-left representation), then generalization is achieved because these sequences of words that have never been seen before have a higher probability. Exercising such a large model (with millions of parameters) in a reasonable amount of time is itself a serious challenge. In this paper, we introduce an experiment using neural network stopping probability function to improve the n-gram model, which can be used in a longer context, and show good results in both texts.
Dispersive Representation: The technique of mapping a large vocabulary from a high-dimensional space to a low-dimensional space while preserving the differences between different words as much as possible. The so-called dispersive representation, there is a dispersive assumption, that is, if the two words have the same context, then the two words are similar in expression. It can be understood as a way to get word expression.
Introduction
The basic problem of making speech modeling and other learning problems difficult is the dimensional disaster. This is especially true when people want to mimic the spread of associations between many discrete random variables, such as words in sentences or discrete attributes in data mining tasks. For example, if you want to mimic the combined spread of 10 consecutive words in the natural language of vocabulary V of 100,000, you might have 100,000 10_1 = 1050_1.
One hundred thousand
Ten
- 1=10
Fifty
1 free parameters. In modeling continuous variables, it is easier to achieve generalization (for example, using smooth function classes, such as multilayer neural networks or Gaussian mixed models), because the expected function to be learned has some partial smoothness. As for discrete spaces, the generalization constructions are not so obvious: any change in these discrete variables can have a great impact on the values of the functions to be evaluated, and when each discrete variable can take a large number of values, most observers in the Hamming interval are simply away from each other.
Starting from the idea of non-parametric density estimation, it is useful to imagine how different window learning algorithms are generalized. One useful way is to think about how the probabilistic mass initially focused on the exercise point (for example, the exercise sentence) is spread in larger volumes, usually in some way adjacent to the exercise point. In high dimensions, it is important to spread probabilistic mass in the center of importance rather than in all directions around each exercise point. In this article, we will show that the general approach presented here is fundamentally different from that of the most advanced statistical speech modeling approaches.
The statistical model of speech can be expressed by the conditional probability of the next word given for all previous words.
Among them, WT
It is the first word t, wji= (WI, wi+1,..., WJ 1, WJ).
WJ) represents a subsequence. This statistical speech model is useful in many applications involving natural speech, such as speech recognition, speech translation and information retrieval. Therefore, the improvement of statistical speech models may have a serious impact on such applications.
When constructing the statistical model of natural speech, the application of word order greatly reduces the difficulty of this modeling problem, and the words which are closer in time in the word sequence become more and more dependent on statistics. Therefore, the N-gram model constructs a conditional probability table of the next word for the combination of n-1 words in a large number of contexts:
We only think about the continuous word combinations that practice presents in the exercise corpus, or the combinations that occur frequently enough. But when there is a new combination of n words, what happens when we do not find it in the exercise corpus? We don't want to assign zero probability to this situation, because such new combinations may occur, and they will occur more frequently in larger contexts. A simple approach is to use smaller corpus stopping probability predictions, such as the backward ternary model or the smooth ternary model. So, in such a model, how is generalization achieved from word sequences seen in the exercise corpus to new word sequences? The way to understand this situation is to think about the generation model of the corresponding interpolated or n-off n-gram model, and then "gluing" the data to train the short and repeated length of 1,2 to n frequently presented words to generate a new word sequence.
Usually researchers use n = 3, or ternary, to achieve the most advanced results, but Goodman (2001) separates many techniques that produce essential improvements. Obviously, more information in the sequence follows the prediction word, not just the identification of the first two words. This approach requires at least two central improvements. First, it does not contemplate the context beyond one or two words. Second, it does not contemplate the "similarities" between words. For example, in the exercise corpus.
Dispersive Representation: The technique of mapping a large vocabulary from a high-dimensional space to a low-dimensional space while preserving the differences between different words as much as possible. The so-called dispersive representation, there is a dispersive assumption, that is, if the two words have the same context, then the two words are similar in expression. It can be understood as a way to get word expression.
Introduction
The basic problem of making speech modeling and other learning problems difficult is the dimensional disaster. This is especially true when people want to mimic the spread of associations between many discrete random variables, such as words in sentences or discrete attributes in data mining tasks. For example, if you want to mimic the combined spread of 10 consecutive words in the natural language of vocabulary V of 100,000, you might have 100,000 10_1 = 1050_1.
One hundred thousand
Ten
- 1=10
Fifty
1 free parameters. In modeling continuous variables, it is easier to achieve generalization (for example, using smooth function classes, such as multilayer neural networks or Gaussian mixed models), because the expected function to be learned has some partial smoothness. As for discrete spaces, the generalization constructions are not so obvious: any change in these discrete variables can have a great impact on the values of the functions to be evaluated, and when each discrete variable can take a large number of values, most observers in the Hamming interval are simply away from each other.
Starting from the idea of non-parametric density estimation, it is useful to imagine how different window learning algorithms are generalized. One useful way is to think about how the probabilistic mass initially focused on the exercise point (for example, the exercise sentence) is spread in larger volumes, usually in some way adjacent to the exercise point. In high dimensions, it is important to spread probabilistic mass in the center of importance rather than in all directions around each exercise point. In this article, we will show that the general approach presented here is fundamentally different from that of the most advanced statistical speech modeling approaches.
The statistical model of speech can be expressed by the conditional probability of the next word given for all previous words.
Among them, WT
It is the first word t, wji= (WI, wi+1,..., WJ 1, WJ).
WJ) represents a subsequence. This statistical speech model is useful in many applications involving natural speech, such as speech recognition, speech translation and information retrieval. Therefore, the improvement of statistical speech models may have a serious impact on such applications.
When constructing the statistical model of natural speech, the application of word order greatly reduces the difficulty of this modeling problem, and the words which are closer in time in the word sequence become more and more dependent on statistics. Therefore, the N-gram model constructs a conditional probability table of the next word for the combination of n-1 words in a large number of contexts:
We only think about the continuous word combinations that practice presents in the exercise corpus, or the combinations that occur frequently enough. But when there is a new combination of n words, what happens when we do not find it in the exercise corpus? We don't want to assign zero probability to this situation, because such new combinations may occur, and they will occur more frequently in larger contexts. A simple approach is to use smaller corpus stopping probability predictions, such as the backward ternary model or the smooth ternary model. So, in such a model, how is generalization achieved from word sequences seen in the exercise corpus to new word sequences? The way to understand this situation is to think about the generation model of the corresponding interpolated or n-off n-gram model, and then "gluing" the data to train the short and repeated length of 1,2 to n frequently presented words to generate a new word sequence.
Usually researchers use n = 3, or ternary, to achieve the most advanced results, but Goodman (2001) separates many techniques that produce essential improvements. Obviously, more information in the sequence follows the prediction word, not just the identification of the first two words. This approach requires at least two central improvements. First, it does not contemplate the context beyond one or two words. Second, it does not contemplate the "similarities" between words. For example, in the exercise corpus.
PREVIOUS:OCR mobile phone scanning and identifica
NEXT:Passport scanner