News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Natural scene text detection in deep learning
Natural scene text detection in deep learning
The difference between text detection and general purpose detection - the text line is a sequence (character, part of a character, a sequence of multi-characters), rather than an ordinary purpose in an ordinary purpose detection. This is both an advantage and a difficulty. The advantage is that different characters on the same text line can apply context to each other and can be represented by the sequence method like RNN. The difficulty lies in the detection of an intact text line. Different characters on the same text line may differ greatly, and are far apart. It is more difficult to detect as a whole than a single purpose. Therefore, the author thinks that the vertical position of the text is predicted ( The upper and lower borders of the text bounding box are easier than the horizontal position (left and right borders of the text bounding box).
The text detection method of the top-down (detecting the text area first, and then finding the text line) is better than the traditional bottom-up detection method (detecting the characters first, and then stringing them into text lines). The drawback of the bottom-up approach is that (this point is more clearly stated in the author's other article), and summing up is not thinking about the context, is not robust enough, the system requires too many sub-modules, is too complicated and the error gradually accumulates. Limited performance.
Seamless separation of RNN and CNN can improve detection accuracy. The CNN is used to extract depth features. The RNN is used to identify the features of the sequence (category 2). The two are seamlessly separated and have better performance for detection.
One of the difficulties with text detection is that the length of the text line changes very violently. So if it is based on the general object detection framework based on faster rcnn and other algorithms will face a problem? How to generate a good text proposal? This problem is actually more difficult to deal with. Therefore, in this article, the author provides another thought, detecting a small, fixed-width text segment, and then disposing of the local segment and then linking these small text segments to obtain the text line.
Method summary
The basic process, such as Fig 1, is divided into six steps:
First, get the feature map (W*H*C) using the first 5 Conv stages of VGG16 (to conv5)
Second, take 3*3*C window features in each location of the Conv5 feature map. These features will be used to predict the corresponding category information for the location of k anchors (the definition of anchor is similar to that of the Faster RCNN). information.
Third, input the 3*3*C feature (W*3*3*C) corresponding to each window of each row into the RNN (BLSTM) to get the output of W*256
Fourth, input WNN256 of RNN to 512-dimensional fc layer
Fifth, the fc layer features are entered into three categories or regression layers. The second 2k scores represent the k-anchor category information (characters or not characters). The first 2k vertical coordinate and the third k side-refinement are used to return the position information of k anchors. The 2k vertical coordinate represents the bounding box's height and the center's y-axis coordinate (which can determine the upper and lower borders), and k side-refinements represent the extent of the bounding box's translation. Note here that only three parameters are used to represent the bounding box of the regression, because it is implicit here that the width of each anchor is 16, and it does not change anymore (the const of the conv5 of VGG16 is 16). The returned boxes are the red, slender rectangles in Fig. 1 and their widths are constant.
Sixth, using a simple text-line structure algorithm, the proposed text is merged into textual lines (slim rectangles in Fig. 1(b)).
Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). We concentratedly slide a 3×3 spatial window through the last convolutional maps (conv5) of the VGG16 model [27]. The sequential windows in each row are Recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which structural predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. (b) The CTPN outputs sequential fixed-width fine-scale text Proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.
Method details
Detecting Text in Fine-scale proposals
k anchor dimensions and aspect ratio settings: width is 16, k = 10, height from 11 to 273 (divided by 0.7 each)
The height of the regression and the y-coordinate of the center of the bounding box are as follows, with the representation of * being groundTruth, the representation with a being an anchor
Score threshold setting: 0.7 (+NMS)
An anchor with a true value IoU greater than 0.7 is used as a positive sample, and an anchor with the greatest true value IoU is also defined as a positive sample. This time does not consider whether the IoU size has reached 0.7, which helps detect small texts.
An anchor with a true value IoU less than 0.5 is defined as a negative sample.
Only save proposal score is greater than 0.7
Ordinary RPN and the effect of using this method to detect the effect of mutatis mutandis
Recurrent Connectionist Text Proposals
RNN type: BLSTM (bi-directional LSTM) with 128 hidden layers per LSTM
RNN input: 3*3*C features of each sliding window (can be pulled into a column), features of the same row of windows form a sequence
RNN output: 256-dimensional features per window
Using the effect of RNN and RNN, CTPN is the method of this article (Connectionist Text Proposal Network)
Side-refinement
Text line structure algorithm (slender proposal merges into a text line)
The main idea: Every two similar proposals form a pair, merge different pairs until they can no longer merge (no common elements)
Determine the conditions for which the two proposal, Bi, and Bj form a pair:
Bj->Bi, and Bi->Bj. (Bj->Bi indicates that Bj is Bi's best neighbor)
Bj->Bi condition 1: Bj is the closest interval Bi in the neighbors of Bi, and the interval is less than 50 pixels
Bj->Bi condition 2: Bj and Bi's vertical overlap is greater than 0.7
Fixing the width and degree position of the box to be regressed will incur an inaccurate position of the box of the predict box, so the author introduces a side-refinement for the degree of position regression. Where xside is the predicted x-coordinate of the nearest horizontal side (eg, left or right side) to current anchor. x∗ side is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT Bounding box and anchor location. cax is the center of anchor in x-axis. wa is the width of anchor, which is fixed, wa = 16
Using the effect of side-refinement
Experimental results
work out
For each exercise picture, a total of 128 samples were taken, 64 positive and 64 negative. If the positive sample is not enough, use negative samples to complete. This is the same as the fast rcnn.
Exercise pictures are shrunk to the short side by 600 pixels.
Prepare exercise data
As mentioned above, this network predicts some fixed-width text proposals, so the true value should be labeled as such. However, ordinary databases are all text lines or word-level annotations. Thus the need to convert these labels into a series of fixed-width boxes. The code is in the prepare_training_data folder.
The entire repo is based on RBG God's faster rcnn change, so based on his input request. To convert the data to the vocal annotation, this partial code is also in the prepare_training_data folder.
Time: 0.14s with GPU
ICDAR2011, ICDAR2013, ICDAR2015 detection results on the library
Summary and harvest
The biggest highlight of this article's approach is to introduce RNN into the detection problem (formerly recognized before). Text detection, first use CNN to obtain the depth features, then use a fixed-width anchor to detect the text proposal (a part of the text line), and string the features of the same row of the anchor into a sequence, input it into the RNN, and finally use the full connection layer. To classify or return, and stop the correct text proposal from merging into text lines. This method of seamlessly separating RNN and CNN improves detection accuracy
The text detection method of the top-down (detecting the text area first, and then finding the text line) is better than the traditional bottom-up detection method (detecting the characters first, and then stringing them into text lines). The drawback of the bottom-up approach is that (this point is more clearly stated in the author's other article), and summing up is not thinking about the context, is not robust enough, the system requires too many sub-modules, is too complicated and the error gradually accumulates. Limited performance.
Seamless separation of RNN and CNN can improve detection accuracy. The CNN is used to extract depth features. The RNN is used to identify the features of the sequence (category 2). The two are seamlessly separated and have better performance for detection.
One of the difficulties with text detection is that the length of the text line changes very violently. So if it is based on the general object detection framework based on faster rcnn and other algorithms will face a problem? How to generate a good text proposal? This problem is actually more difficult to deal with. Therefore, in this article, the author provides another thought, detecting a small, fixed-width text segment, and then disposing of the local segment and then linking these small text segments to obtain the text line.
Method summary
The basic process, such as Fig 1, is divided into six steps:
First, get the feature map (W*H*C) using the first 5 Conv stages of VGG16 (to conv5)
Second, take 3*3*C window features in each location of the Conv5 feature map. These features will be used to predict the corresponding category information for the location of k anchors (the definition of anchor is similar to that of the Faster RCNN). information.
Third, input the 3*3*C feature (W*3*3*C) corresponding to each window of each row into the RNN (BLSTM) to get the output of W*256
Fourth, input WNN256 of RNN to 512-dimensional fc layer
Fifth, the fc layer features are entered into three categories or regression layers. The second 2k scores represent the k-anchor category information (characters or not characters). The first 2k vertical coordinate and the third k side-refinement are used to return the position information of k anchors. The 2k vertical coordinate represents the bounding box's height and the center's y-axis coordinate (which can determine the upper and lower borders), and k side-refinements represent the extent of the bounding box's translation. Note here that only three parameters are used to represent the bounding box of the regression, because it is implicit here that the width of each anchor is 16, and it does not change anymore (the const of the conv5 of VGG16 is 16). The returned boxes are the red, slender rectangles in Fig. 1 and their widths are constant.
Sixth, using a simple text-line structure algorithm, the proposed text is merged into textual lines (slim rectangles in Fig. 1(b)).
Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). We concentratedly slide a 3×3 spatial window through the last convolutional maps (conv5) of the VGG16 model [27]. The sequential windows in each row are Recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which structural predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. (b) The CTPN outputs sequential fixed-width fine-scale text Proposals. Color of each box indicates the text/non-text score. Only the boxes with positive scores are presented.
Method details
Detecting Text in Fine-scale proposals
k anchor dimensions and aspect ratio settings: width is 16, k = 10, height from 11 to 273 (divided by 0.7 each)
The height of the regression and the y-coordinate of the center of the bounding box are as follows, with the representation of * being groundTruth, the representation with a being an anchor
Score threshold setting: 0.7 (+NMS)
An anchor with a true value IoU greater than 0.7 is used as a positive sample, and an anchor with the greatest true value IoU is also defined as a positive sample. This time does not consider whether the IoU size has reached 0.7, which helps detect small texts.
An anchor with a true value IoU less than 0.5 is defined as a negative sample.
Only save proposal score is greater than 0.7
Ordinary RPN and the effect of using this method to detect the effect of mutatis mutandis
Recurrent Connectionist Text Proposals
RNN type: BLSTM (bi-directional LSTM) with 128 hidden layers per LSTM
RNN input: 3*3*C features of each sliding window (can be pulled into a column), features of the same row of windows form a sequence
RNN output: 256-dimensional features per window
Using the effect of RNN and RNN, CTPN is the method of this article (Connectionist Text Proposal Network)
Side-refinement
Text line structure algorithm (slender proposal merges into a text line)
The main idea: Every two similar proposals form a pair, merge different pairs until they can no longer merge (no common elements)
Determine the conditions for which the two proposal, Bi, and Bj form a pair:
Bj->Bi, and Bi->Bj. (Bj->Bi indicates that Bj is Bi's best neighbor)
Bj->Bi condition 1: Bj is the closest interval Bi in the neighbors of Bi, and the interval is less than 50 pixels
Bj->Bi condition 2: Bj and Bi's vertical overlap is greater than 0.7
Fixing the width and degree position of the box to be regressed will incur an inaccurate position of the box of the predict box, so the author introduces a side-refinement for the degree of position regression. Where xside is the predicted x-coordinate of the nearest horizontal side (eg, left or right side) to current anchor. x∗ side is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT Bounding box and anchor location. cax is the center of anchor in x-axis. wa is the width of anchor, which is fixed, wa = 16
Using the effect of side-refinement
Experimental results
work out
For each exercise picture, a total of 128 samples were taken, 64 positive and 64 negative. If the positive sample is not enough, use negative samples to complete. This is the same as the fast rcnn.
Exercise pictures are shrunk to the short side by 600 pixels.
Prepare exercise data
As mentioned above, this network predicts some fixed-width text proposals, so the true value should be labeled as such. However, ordinary databases are all text lines or word-level annotations. Thus the need to convert these labels into a series of fixed-width boxes. The code is in the prepare_training_data folder.
The entire repo is based on RBG God's faster rcnn change, so based on his input request. To convert the data to the vocal annotation, this partial code is also in the prepare_training_data folder.
Time: 0.14s with GPU
ICDAR2011, ICDAR2013, ICDAR2015 detection results on the library
Summary and harvest
The biggest highlight of this article's approach is to introduce RNN into the detection problem (formerly recognized before). Text detection, first use CNN to obtain the depth features, then use a fixed-width anchor to detect the text proposal (a part of the text line), and string the features of the same row of the anchor into a sequence, input it into the RNN, and finally use the full connection layer. To classify or return, and stop the correct text proposal from merging into text lines. This method of seamlessly separating RNN and CNN improves detection accuracy