News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Video detection based on AI deep learning
Video detection based on AI deep learning
The first topic of artificial intelligence in video is the video understanding, trying to solve the problem of "semantic distance", including:
Video structured analysis: the video is framed, superframes, shots, scenes, stories, etc., and then processed and expressed at multiple levels.
Policy detection and tracking: such as vehicle tracking, mostly used in the security field.
Character recognition: Identify the characters that appear in the video.
Motion recognition: Activity Recognition, to identify the action of the characters in the video.
Emotional Semantic Analysis: What kind of mental experience will happen when a viewer appreciates a certain video.
Most of the short video and live video bear the content information of the character + scene + action + voice. As shown in Figure 1, how to express the content with useful features is the key to understanding the video.
There are a lot of traditional craftsmanship features, and now IDT (Improved Dense Trajectories) is better. I won't talk about it here. Deep learning has a very good ability to express image content, and there are corresponding methods in the content expression of video. The following describes several technical methods for the mainstream flow in recent years.
Single frame identification method
One of the most straightforward methods is to cut the video frame and then perform deep learninig expression according to the image granularity (single frame). As shown in Figure 2, a certain frame of the video obtains an identification effect through the network. Figure 2 shows a typical CNN network. The red rectangle is the convolution layer, the green is the normalized layer, the blue is the pooling layer, and the yellow is the fully connected layer. But a picture is a small part of the whole video, especially if the frame is not so discriminative, or some images that are not related to the video theme, it will make the classifier puzzled. Thus, the expression in the learning video time domain is the primary factor in forward video recognition. Of course, this is discriminating on a sporty video, and it can only rely on the characteristics of the image on a more stopped video.
Write a picture here
Identification method based on CNN extended network
Its whole idea is to find some way in the time domain to express part of the motion information in the CNN structure, and then get the progress of the overall recognition performance. Figure 3 is a network structure, which has three layers in total, and performs a convolution of MxNx3xT on the image sequence of 10 frames (about one-third of a second) in the first layer (where MxN is the resolution of the image, and 3 is the 3 color channels of the image). , T takes 4, is the number of frames involved in the calculation, and then constitutes 4 measurements on the time axis), and performs time convolution of T=2 on the 2nd and 3rd layers, then the 10 frames are included in the 3rd layer. Time and space information of everything. The same layer of network parameters of the network at different times are shared parameters.
Its overall accuracy is about 2% relative to a single frame, especially in motion-rich video, such as wrestling, climbing, and other strong motion video types, which have a large rugged advancement, which then proves the identification of motion information in the feature. It is a contribution. In implementation, this network architecture can participate in multi-resolution processing methods and can move forward.
Write a picture here
Two-way CNN identification method
This is actually two independent neural networks, and after all, the effect of the two models is even. The above is a general single-frame CNN, and the article mentioned that the CNN is pre-trained on ImageNet data, and then the final layer is adjusted on the video data. The following CNN network is to stack several consecutive frames of optical flow as input to the CNN. Others, it uses multi-task learning to overcome the lack of data. In fact, the final layer of CNN is connected to multiple softmax layers, corresponding to different data sets, so that multi-task learning can be performed on multiple data sets.
Video structured analysis: the video is framed, superframes, shots, scenes, stories, etc., and then processed and expressed at multiple levels.
Policy detection and tracking: such as vehicle tracking, mostly used in the security field.
Character recognition: Identify the characters that appear in the video.
Motion recognition: Activity Recognition, to identify the action of the characters in the video.
Emotional Semantic Analysis: What kind of mental experience will happen when a viewer appreciates a certain video.
Most of the short video and live video bear the content information of the character + scene + action + voice. As shown in Figure 1, how to express the content with useful features is the key to understanding the video.
There are a lot of traditional craftsmanship features, and now IDT (Improved Dense Trajectories) is better. I won't talk about it here. Deep learning has a very good ability to express image content, and there are corresponding methods in the content expression of video. The following describes several technical methods for the mainstream flow in recent years.
Single frame identification method
One of the most straightforward methods is to cut the video frame and then perform deep learninig expression according to the image granularity (single frame). As shown in Figure 2, a certain frame of the video obtains an identification effect through the network. Figure 2 shows a typical CNN network. The red rectangle is the convolution layer, the green is the normalized layer, the blue is the pooling layer, and the yellow is the fully connected layer. But a picture is a small part of the whole video, especially if the frame is not so discriminative, or some images that are not related to the video theme, it will make the classifier puzzled. Thus, the expression in the learning video time domain is the primary factor in forward video recognition. Of course, this is discriminating on a sporty video, and it can only rely on the characteristics of the image on a more stopped video.
Write a picture here
Identification method based on CNN extended network
Its whole idea is to find some way in the time domain to express part of the motion information in the CNN structure, and then get the progress of the overall recognition performance. Figure 3 is a network structure, which has three layers in total, and performs a convolution of MxNx3xT on the image sequence of 10 frames (about one-third of a second) in the first layer (where MxN is the resolution of the image, and 3 is the 3 color channels of the image). , T takes 4, is the number of frames involved in the calculation, and then constitutes 4 measurements on the time axis), and performs time convolution of T=2 on the 2nd and 3rd layers, then the 10 frames are included in the 3rd layer. Time and space information of everything. The same layer of network parameters of the network at different times are shared parameters.
Its overall accuracy is about 2% relative to a single frame, especially in motion-rich video, such as wrestling, climbing, and other strong motion video types, which have a large rugged advancement, which then proves the identification of motion information in the feature. It is a contribution. In implementation, this network architecture can participate in multi-resolution processing methods and can move forward.
Write a picture here
Two-way CNN identification method
This is actually two independent neural networks, and after all, the effect of the two models is even. The above is a general single-frame CNN, and the article mentioned that the CNN is pre-trained on ImageNet data, and then the final layer is adjusted on the video data. The following CNN network is to stack several consecutive frames of optical flow as input to the CNN. Others, it uses multi-task learning to overcome the lack of data. In fact, the final layer of CNN is connected to multiple softmax layers, corresponding to different data sets, so that multi-task learning can be performed on multiple data sets.