Object Detection Summary

I feel I was reading a lot of LLM related topics recently but getting far away from CV. I happened to read this post from v_JULY_v and it’s a good review for object detection technics, and prepare myself to review on Stable diffusion and vidoe generation.

Here is the overview for OD. Alt text

1. R-CNN

Two approaches for OD. First is regression based. Use regression to generate 4 coordinates (x, y, w, h). But the training is too hard Alt text

To improve from here, we use windows to iterate the whole image. and we also use convolution layer to replace FC to speedup. Alt text

Region Proposal by Selective Search or EdgeBoxes are proposed by R-CNN. Get ~2000 Regions of Interests (RoI) first, and warp them into same size image (227x227) and then sent to CNN for featrue extraction. Alt text Here are the training steps

  • Finetune a AlexNet with last layter to number of classes. Alt text
  • For each RoI, run CNN and save feature map to disk Alt text
  • Run SVM for binary classification Alt text
  • Run regression for region adjustment Alt text

2. SPP Net

Kaiming He published Spatial Pyramid Pooling (SPP) paper in 2015.

In R-CNN, region proposals needs to be warp into 227x227 b/c FC layer needs fixed input(so the conv layer before FC needs fixed input, that’s the purpose of warping). But CNN does NOT have this requirement. So how about we add a special layer to feed fixed size to FC so we don’t need to warp the image!. There is the difference between R-CNN and SPP Alt text

The key idea is if you make the pooling window and stride proportional to the input image, you can always get a fixed-sized output. Alt text

Another improvement is ONLY calculation conv ONCE for the whole image and extract corresponding patch for each RoI.

3. Fast R-CNN

Apply SPP into R-CNN. Alt text

  • Add RoI pooling layer, which is a simple version of SPP
  • Add Bounding Box Regression into CNN training to get a multi-task model. Alt text Also, run conv once for the whole picture instead of on each region Alt text

    4. Faster R-CNN

  • Use Region Proposal Network (RPN) to replace the selective search
  • Use anchor box Alt text Notice there are 4 loss functions
    • RPN classification (anchor good/bad)
    • RPN regression (anchor -> proposal)
    • Fast R-CNN classification (over classes)
    • Fast R-CNN regression (proposal -> box)

Here are the summaries of these 4 methods before going to DL based regression approaches

Selective Search   Selective Search RPN
  RoI Pooling RoI Pooling RoI Pooling
CNN(feature extraction)+SVM(classification)   CNN CNN


  • Divide image into SxS grid (S=7)
  • Predict B bounding boxes (B=2) with 5 values, (x, y, w, h, confidence) and C classes
  • Use NMS(Non-Maximun Suppresion) to get rid of extra windowes Alt text With Region of Proposal, the accuracy suffers

6. SSD

Adding back anchor boxes

  • Go through certain conv layer to get m x n feature map with p channel
  • For each location, get k bounding boxes with different ratio (here are the anchors)
  • For each box, compute c class and 4 offsets for (x, y ,w, h) In total will get (c+4)mnk outputs Alt text Different layers of feature maps also going through 3x3 conv for OD. So it has 8732 bounding boxes which is way more than 98 from YOLO. (details are here) Alt text


