1 CNN Convolution Foundation
What should the filter be like? Assuming the input image shape is (64, 64, 3), then the last dimension of the filter must also be 3. If the filter size is 5, then a filter shape should be (5, 5, 3) .
Convolution in CNN is actually a cross-correlation, but in CNN's article, everyone uses it. In fact, the effect is indeed the same as convolution.
A filter(5,5,3) is applied to input(64, 64, 3). If padding=0 and stride=1, the first result is (60, 60, 3), and then on each pixel. The channel value is superimposed, like `np.sum(input, axis = -1)`. The final filter goes down. The result is (60, 60, 1). If there are F filters, then the output is (60, 60, F)
Why use a convolution? It is mainly related to the nature of the data of the image. The semi-structured data of the image shows similarities in the local space. There are two main benefits: 1. Parameter sharing 2. Sparse connections. The number of layer weights per layer is just related to filter. b For example, the filter size (5, 5, 16), a 64 filter, the parameter is (400 + 1) * 64.
In general, as the number of layers increases, H and W continue to decrease while C increases. (C can be considered as the number of feature detectors)
2 Several classic CNN implementations
I copied the ppt directly, and I think there are no more concepts introduced in these implementations (compared to the resnet and inception).
Lenet-5 solved the problem of handwritten digit recognition at the time. At the time, when I looked at the paper, I also cut a picture. I can compare the figures given by Andrew Ng with the figures in the paper. It looks like it is almost. A total of about 60k parameters.
Lenet-5 solved the problem of handwritten digit recognition at the time. At the time, when I looked at the paper, I also cut a picture. I can compare the figures given by Andrew Ng with the figures in the paper. It looks like it is almost. A total of about 60k parameters.
Alexnet is the CNN of the geoff hinton team in the 2012 imagenet competition. There is no essential difference with lenet-5, only bigger, and then use relu, multiple GPUs (because the GPU memory is very small), local response normalization (lrn) technology. The parameters are about 60m, 1000 times that of lenet-5.
Alexnet uses a lot of strange parameters and structure, vgg is trying to use a simple configuration but a larger network, to get better results? The uniform (3,3) filter is used in vgg, and then the max-pool (2,2) is more structured. Vgg-16 has 138M parameters and vgg-19 has more parameters.
3 Residual Network
Residual network, the effect is that after a few layers, the original input is superimposed. The residual network is composed of multiple residual blocks
Overlaying the original input can be a skip connection. With these residual blocks, these blocks can be combined to make a residual network.
Resnet can train deeper neural networks better, a very important reason is because res block can easily learn identity mapping and gradients disappear. The following slide shows why it is easier to learn id mapping. If w,b~=0 and g is a relu function, then a[l+2] ~= g(a[l]) = a[l]
4 Inception Network
Inception network is google out, the starting point is very simple, that is, I am not sure what the filter size is more appropriate, so simply try 1, 3, 5, and then put together as output.
It used a 1x1 filter. The filter of 1x1 is not meaningful. After apply, h and w are unchanged. Only c will change. This filter has a special purpose is to reduce the amount of conv calculations. Take a look at the two figures below
- The first calculation is (5 * 5 * 192) * (28 * 28 * 32) ~= 120M
- The second calculation is indirectly obtained by 1x1 conv
- Stage1 192 * (28 * 28 * 16) ~= 2.4M
- Stage2 (5 * 5 * 16) * (28 * 28 * 32) ~= 10M
- A total of about 12.4M
- I deduced on paper as if I could calculate this, provided I had to pass a 1x1 conv
If 1x1 conv can help 3x3 conv, 5x5 cov, so the final inception block is the following
These blocks are then organized into the following inception network. Andrew labels each inception bock above the slide. This inception network is also known as googLeNet, and the last face is also a tribute to Lenet.
5 Practical Advices
Data Augmentation can be used moderately on images, including the following methods:
- Mirroring mirror operation
- Cropping operation
- Color shifting RGB offset operation
These methods are helpful for CNN learning, and some methods such as rotate etc. do not play a role in CNN.
You must use ensemble during the game, and during the predict phase, you can perform multi crop on the test input, cut out multiple images to make predictions, and then synthesize the results. However, these techniques are not used in practice and have an impact on performance.
6 Object Detection
The method of detecting a naive/simple object by an object may be to match the image with sliding windows of different sizes, and then input an image below each sliding window to CNN to determine whether there is an object in the small image. Let's call this method sliding window.
One disadvantage of this approach is that there is no way to dynamically adjust the sliding window size, or to define it in advance. In addition, the naive implementation of this method is relatively large, and there are ways to reduce the amount of computation. It is the combination of convolution and sliding window. The following figure illustrates this approach: FC can actually be represented using the convolutional layer, so that the entire network is conv. Then in Output, it can correspond to the calculation result of each sliding window.
The state-of-art technique on object detection should be YOLO (you only look once). The implementation is divided into the following steps: (first assume we only detect one object)
- Divide the input image into 9x9 or 19x19 grids.
- Each grid is forecasted separately and the output includes (P, bx, by, bh, bw). Each five-tuple is a box and can have a lot of boxes
- P indicates the probability of detecting an object
- Where bx, by is the middle point of the object, and bh, bw is the length and width
- It is numerically the proportion of the grid, such as bx, by = 0.5, 0.5, indicating that the middle point is in the center of the grid
- Bh, bw can exceed 1, that is to occupy multiple grids
- Run non-max suppression on these boxes. The algorithm is simple
- Remove the box's P below a certain value (0.4). The probability of these boxes being object is small.
- Select the largest box(A) from boxes and consider that there are objects in A
- Then in the remaining boxes, if the ratio of overlapping with A exceeds a certain value (0.5), then discard it.
- Repeat 2 until there are no boxes
Coincidence ratio calculation can use IOU (intersect over union), that is, coincident area / (A + B - coincident area). If the IOU is high, I think the two boxes are actually duplicated.
If multiple objects need to be detected, multiple objects may be detected in a grid. An anchor box is needed to help. The anchor box actually helps to locate what kind of objects. For example:
- Two objects are detected in the bottom grid.
- The beauty is more compatible with the anchor box1, and the car is more compatible with the anchor box2.
- y contains two anchor corresponding vectors
7 Face Recognition
Face Recognition can in fact be solved by another problem, that is, FV/Face Verification. Face Verification will compare the similarity of two images, and FR can use this similarity to choose who is the best match.
Suppose now that we already have a network (DeepFace, FaceNet) such as VGG-16 that can recognize pictures, we use the last layer of FC. The value of this layer of FC can be considered as the fingerprint (fingerprint). By comparing these two fingerprints Distance, you can get the similarity of the two pictures.
How to train this network? Triplet loss . The general idea is that there is an anchor, positive, and negative. Make dist(anchor, positive) as small as possible and dist(anchor, negative) as large as possible. .
8 Style Transfer (Neural Style Transfer)
The difficulty of style migration is to define the cost function. Assume that the content image is C (content), the style image is S (style), the generated image is G (generated), then J (C, S, G) = J (C, G) + J(S, G).
J(C, G) denotes the deviation between the content image and the generated image. Suppose we have VGG-16, then we only need to extract C, G is output in a relatively high layer(conv4-2) of vgg-16, and then calculate. Distance.
J(S, G) represents the deviation between the style image and the generated image. The style can be considered as the association of each pixel value on each channel. Specifically, the style difference of each layer needs to be calculated and then superimposed.






























