Object Detection Summary

2 minute read

I feel I was reading a lot of LLM related topics recently but getting far away from CV. I happened to read this post from v_JULY_v and it’s a good review for object detection technics, and prepare myself to review on Stable diffusion and vidoe generation.

Here is the overview for OD. Alt text

1. R-CNN

Two approaches for OD. First is regression based. Use regression to generate 4 coordinates (x, y, w, h). But the training is too hard Alt text

To improve from here, we use windows to iterate the whole image. and we also use convolution layer to replace FC to speedup. Alt text

Region Proposal by Selective Search or EdgeBoxes are proposed by R-CNN. Get ~2000 Regions of Interests (RoI) first, and warp them into same size image (227x227) and then sent to CNN for featrue extraction. Alt text Here are the training steps

Finetune a AlexNet with last layter to number of classes.
For each RoI, run CNN and save feature map to disk
Run SVM for binary classification
Run regression for region adjustment

2. SPP Net

Kaiming He published Spatial Pyramid Pooling (SPP) paper in 2015.

In R-CNN, region proposals needs to be warp into 227x227 b/c FC layer needs fixed input(so the conv layer before FC needs fixed input, that’s the purpose of warping). But CNN does NOT have this requirement. So how about we add a special layer to feed fixed size to FC so we don’t need to warp the image!. There is the difference between R-CNN and SPP Alt text

The key idea is if you make the pooling window and stride proportional to the input image, you can always get a fixed-sized output. Alt text

Another improvement is ONLY calculation conv ONCE for the whole image and extract corresponding patch for each RoI.

3. Fast R-CNN

Apply SPP into R-CNN. Alt text

Add RoI pooling layer, which is a simple version of SPP
Add Bounding Box Regression into CNN training to get a multi-task model. Also, run conv once for the whole picture instead of on each region
4. Faster R-CNN
Use Region Proposal Network (RPN) to replace the selective search
Use anchor box Notice there are 4 loss functions
- RPN classification (anchor good/bad)
- RPN regression (anchor -> proposal)
- Fast R-CNN classification (over classes)
- Fast R-CNN regression (proposal -> box)

Here are the summaries of these 4 methods before going to DL based regression approaches

R-CNN	SPP	Fast R-CNN	Faster R-CNN
Selective Search		Selective Search	RPN
	RoI Pooling	RoI Pooling	RoI Pooling
CNN(feature extraction)+SVM(classification)		CNN	CNN

5. YOLO

Divide image into SxS grid (S=7)
Predict B bounding boxes (B=2) with 5 values, (x, y, w, h, confidence) and C classes
Use NMS(Non-Maximun Suppresion) to get rid of extra windowes With Region of Proposal, the accuracy suffers

6. SSD

Adding back anchor boxes

Go through certain conv layer to get m x n feature map with p channel
For each location, get k bounding boxes with different ratio (here are the anchors)
For each box, compute c class and 4 offsets for (x, y ,w, h) In total will get (c+4)mnk outputs Different layers of feature maps also going through 3x3 conv for OD. So it has 8732 bounding boxes which is way more than 98 from YOLO. (details are here)

Twitter Facebook LinkedIn

Object Detection Summary

1. R-CNN

2. SPP Net

3. Fast R-CNN

4. Faster R-CNN

5. YOLO

6. SSD

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX