News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
基于深度学习的视频识别方法 Video recognition method based on depth learning
基于深度学习的视频识别方法
Video recognition method based on depth learning
Deep learning has been particularly hot in the last decade, and is almost the biggest contributor to the AI wave. Internet video has also been very popular in recent years. New UGC models such as short video and video live have been firmly seized by users' consumption and become another weapon for Internet to absorb gold. What kind of chemical reaction will be produced when the two fires are touched together?
Without specific technology, first a welfare map shows the cognitive effect of a machine on a video. Its total red word is objects, and the blue word is scenes, and the green word is activities.
The main topic of the application of artificial intelligence to video is video understanding and the problem of solving the "semantic gap", which includes:
Video structured analysis: video is divided into frames, super frames, shots, scenes, stories and so on, which are processed and expressed at many levels.
Target detection and tracking: for example, vehicle tracking is mostly used in the security field.
Character recognition: identify the characters that appear in the video.
Action recognition: Activity Recognition, which identifies the movements of the characters in the video.
Emotional semantic analysis: what kind of psychological experience the audience produces when watching a video.
Most of the short videos and live videos are loaded with character + scene + action + voice content information. As shown in Figure 1, how to express effective content with its content is the key to understand this kind of video. There are a lot of traditional handmade features, and iDT (Improved Dense Trajectories) is better now, and there is no discussion here. The ability of deep learning to express the content of the image is very good, and there is also a method to express the content of the video. The following are the technical methods of the mainstream in recent years.
1. A single frame based recognition method
One of the most direct ways is to cut frames for video, and then to express deep learninig based on image granularity (single frame). As shown in Figure 2, a frame of video obtains a recognition result through the network. Figure 2 is a typical CNN network. The red rectangle is the convolution layer, the green is the normalization layer, the blue is the pool layer, and the yellow is the fully connected layer. However, a picture is a small part of the whole video. Especially when the frame is not so discriminative, or some image independent of the video topic, the classifier will be confused. Therefore, learning the expression on the video time domain is the main factor to improve the video recognition. Of course, this has a regional diversity on a moving video, which can only be characterized by the image on a more static video.
2, recognition method based on CNN extended network
The general idea is to look for a pattern in the time domain to express local motion information in the CNN framework, so as to improve the overall recognition performance. Figure 3 is the network structure, it has a total of three, in the first layer of 10 frames (about 1/3 seconds) convolution image sequence MxNx3xT (where MxN is the resolution of the image is 3, 3 color channels, the image of the T 4, is involved in the calculation of frames, thus forming on the time axis 4 in response, T=2) time convolution in the second, third layer, then in the third layer contains all spatiotemporal information the 10 picture frame. The network parameters of the same layer at different times are shared parameters.
Its overall accuracy has increased by about 2% relative to that of single frame, especially in sports videos, such as wrestling and pole climbing. At the time of implementation, this network architecture can be added to multiresolution processing methods, and can improve the speed.
3, two way CNN recognition method
This is actually two independent neural networks, and finally the results of the two models are averaging. The above one is the ordinary single frame CNN, and in the article, the CNN is pre-train on the ImageNet data, and then adjust the last layer on the video data. The following CNN network is to stack a few consecutive frames of light as the input of the CNN. In addition, it uses multi-task learning to overcome the problem of insufficient data. In fact, the last layer of CNN is connected to multiple softmax layers, corresponding to different data sets, so that multi-task learning can be performed on multiple data sets. The network structure is shown in Figure 4.
4, the recognition method based on LSTM
Its basic idea is to integrate the time axis with the activation of the last layer of the frame's CNN with LSTM. Here, it does not integrate with the last feature after the CNN full connection layer, because the high level feature after the full connection layer is pooled, which has lost the information of spatial characteristics on the time axis. Compared with the 2 methods, on the one hand, it can be longer fusion of CNN feature, not to be treated in order to limit the number of frames, longer length video expression; on the other hand, the 2 methods did not consider the same time in the network frame of the sequence, and the memory unit network by introducing LSTM and can effectively express the sequence of frames. The network structure is shown in Figure 5.
In Figure 5, red is the convolution network, the grey is the LSTM unit, and the yellow is the softmax classifier. LSTM takes the CNN's last layer convolution feature of each continuous frame as input, advancing from left to right, and 5 layers LSTM from bottom to top, and the softmax layer at the top will give classification results at every time point. Similarly, the network is not at the same time
Without specific technology, first a welfare map shows the cognitive effect of a machine on a video. Its total red word is objects, and the blue word is scenes, and the green word is activities.
The main topic of the application of artificial intelligence to video is video understanding and the problem of solving the "semantic gap", which includes:
Video structured analysis: video is divided into frames, super frames, shots, scenes, stories and so on, which are processed and expressed at many levels.
Target detection and tracking: for example, vehicle tracking is mostly used in the security field.
Character recognition: identify the characters that appear in the video.
Action recognition: Activity Recognition, which identifies the movements of the characters in the video.
Emotional semantic analysis: what kind of psychological experience the audience produces when watching a video.
Most of the short videos and live videos are loaded with character + scene + action + voice content information. As shown in Figure 1, how to express effective content with its content is the key to understand this kind of video. There are a lot of traditional handmade features, and iDT (Improved Dense Trajectories) is better now, and there is no discussion here. The ability of deep learning to express the content of the image is very good, and there is also a method to express the content of the video. The following are the technical methods of the mainstream in recent years.
1. A single frame based recognition method
One of the most direct ways is to cut frames for video, and then to express deep learninig based on image granularity (single frame). As shown in Figure 2, a frame of video obtains a recognition result through the network. Figure 2 is a typical CNN network. The red rectangle is the convolution layer, the green is the normalization layer, the blue is the pool layer, and the yellow is the fully connected layer. However, a picture is a small part of the whole video. Especially when the frame is not so discriminative, or some image independent of the video topic, the classifier will be confused. Therefore, learning the expression on the video time domain is the main factor to improve the video recognition. Of course, this has a regional diversity on a moving video, which can only be characterized by the image on a more static video.
2, recognition method based on CNN extended network
The general idea is to look for a pattern in the time domain to express local motion information in the CNN framework, so as to improve the overall recognition performance. Figure 3 is the network structure, it has a total of three, in the first layer of 10 frames (about 1/3 seconds) convolution image sequence MxNx3xT (where MxN is the resolution of the image is 3, 3 color channels, the image of the T 4, is involved in the calculation of frames, thus forming on the time axis 4 in response, T=2) time convolution in the second, third layer, then in the third layer contains all spatiotemporal information the 10 picture frame. The network parameters of the same layer at different times are shared parameters.
Its overall accuracy has increased by about 2% relative to that of single frame, especially in sports videos, such as wrestling and pole climbing. At the time of implementation, this network architecture can be added to multiresolution processing methods, and can improve the speed.
3, two way CNN recognition method
This is actually two independent neural networks, and finally the results of the two models are averaging. The above one is the ordinary single frame CNN, and in the article, the CNN is pre-train on the ImageNet data, and then adjust the last layer on the video data. The following CNN network is to stack a few consecutive frames of light as the input of the CNN. In addition, it uses multi-task learning to overcome the problem of insufficient data. In fact, the last layer of CNN is connected to multiple softmax layers, corresponding to different data sets, so that multi-task learning can be performed on multiple data sets. The network structure is shown in Figure 4.
4, the recognition method based on LSTM
Its basic idea is to integrate the time axis with the activation of the last layer of the frame's CNN with LSTM. Here, it does not integrate with the last feature after the CNN full connection layer, because the high level feature after the full connection layer is pooled, which has lost the information of spatial characteristics on the time axis. Compared with the 2 methods, on the one hand, it can be longer fusion of CNN feature, not to be treated in order to limit the number of frames, longer length video expression; on the other hand, the 2 methods did not consider the same time in the network frame of the sequence, and the memory unit network by introducing LSTM and can effectively express the sequence of frames. The network structure is shown in Figure 5.
In Figure 5, red is the convolution network, the grey is the LSTM unit, and the yellow is the softmax classifier. LSTM takes the CNN's last layer convolution feature of each continuous frame as input, advancing from left to right, and 5 layers LSTM from bottom to top, and the softmax layer at the top will give classification results at every time point. Similarly, the network is not at the same time