Day 84(DL) — YOLOV2 / YOLO9000 (Better, Faster & Stronger) — part2

This article is the continuation of the prior blog which discussed the enhancements made to the YOLOV1 to render the version2. we’ll wind up the post with the rest of the additional improvements made.

Direct location prediction: The second issue with incorporating the anchor boxes in the YOLO is the model instability during the initial iterations. The instability comes from predicting the (x,y) locations for the box. If there is no bounding limit imposed, the box may shift to take the centre as another feature point leaving the focal point from where it is originated. Then, the model takes a long time to converge into the desirable bounds.

  • The better approach is to predict the location coordinates w.r.t to the location of the grid cell. The resultant will be 5 predicted bounding boxes at each cell in the feature map. The network predicts 5 coordinates for each bounding box, tx, ty, tw, th, and to. Below is the formula for the predictions including the location restrictions.
Fig 1 — shows restricting the bb box coordinates with the cell coordinates —

As ‘cx’ and ‘cy’ are the top left corner of the grid from where the bounding box is evolved, adding that in the predicted values (‘bx’ & ‘by’) ensures the coordinates of the predicted box is confined within the limit of the feature point(in other words the grid).

Fine-Grained Features: The 13 x 13 feature map in the YOLOv2 does a great job when it comes to predicting the larger objects but for localising smaller ones we leverage the concept of fine-grained features. The motivation is taken from the which takes the concept of skip connection.

  • A new pass-through layer is introduced which extracts the features from earlier layers at a resolution of 26 x 26. It concatenates the higher resolution features with the lower ones using different channels, converting the 26 x 26 x 512 to 13 x 13 x 2048 that can be merged with the original feature map.

Multi-Scale Training: Instead of fixing the input size constant, this model is more flexible with a varying range of input shapes. The input size of the initial YOLO is 448 x 448 and with the addition of anchor boxes, the resolution is downsized to 416 x 416.

  • To make the model more robust, for every 10 batches, the network randomly selects new dimensions. As the downsampling factor is 32, the different shapes are pulled from the multiples of 32{320, 352, …,608}. The smallest option is 320 x 320 whereas the maximum resolution comes around 608 x 608.
  • This makes the network predict well across a wide range of dimensions based on the trade-off between trade and accuracy. For instance, the model with 288 x 288 runs at more than 90 FPS with mAP similar to Fast R-CNN. Such type of networks can be installed in smaller GPUs for high frame rate video use cases.
Fig 2 — shows accuracy(mAP) Vs speed(FPS) — from

The YOLOv2 models used here have the same weights and trained on the same set, the only variation is a change in the evaluation dimension. Higher the size of width and height(high resolution), more the accuracy(mAP).

Faster: With the introduction of architecture called darknet, the speed of the network is tremendously uplifted. Since the darknet uses fewer parameters when compared to VGG-16(the usual choice for extracting features), it is super fast. The architecture of the darknet is as follows,

Fig 3 — shows darknet architecture

YOLOv2 is trained on the ImageNet 1000 class dataset for the classification task. The network is modified for detection by removing the last convolutional layer and adding 3x3 convolutional and 1 x 1 convolutional layers. For VOC dataset, the detection process predicts 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. The same training strategy is used on the COCO dataset.

Recommended Reading:

AI Enthusiast | Blogger✍