Day 88(DL) — YOLOv4: Optimal Speed and Accuracy of Object Detection — Part 2

Bag of Specials: Unlike Bag of freebies, there will be some inference cost incurred for the modules that fall under the category of Bag of Specials. But, these techniques boost the accuracy of object detection. The principal functionalities of these plugins include enlarging the receptive field or strengthening feature integration capability.

  • The widely used modules to improve receptive field are SPP, ASPP and RFB. The SPP(Spatial Pyramid Pooling) module takes the idea from Spatial Pyramid matching(SPM). The objective of SPM is to split the feature map into several d x d equal blocks, where d can be {1,2,3…} thus forming a spatial pyramid and then extracting bag-of-word features.
  • SPP takes SPM to infuse into CNN but instead of a bag-of-words operation, max-pooling is used. Since the output of SPP is a one-dimensional feature vector, it is incompatible to be applied in Fully Convolutional Network(FCN). To make it more suitable, SPP is enhanced to concatenate max-pooling outputs with kernel size k x k, where k={1,5,9,13} and stride equals 1. This approach(large k x k max-pooling) improves the receptive field of the backbone feature.
  • Another area of improvement is the type of activation employed. In order to tackle the vanishing gradient issue, ReLU and its variants such as LReLU, PReLU, ReLU6, Scaled Exponential Linear Unit(SELU), Swish, hard-Swish and Mish are also incorporated.
  • The final post-processing method in object detection is NMS to suppress those BBoxes that badly predict the same object, and only retain the candidate BBoxes with a higher response.

Having discussed the common topics of interest while performing object detection, it’s time to jump into the actual enhancements made to achieve greater speed.

Selection of architecture: The aim is to find the optimal balance between the input network resolution, the convolutional layer number, the parameter number(filter_size² * filters * channel/groups), and the number of layer outputs (filters). The results of experiments indicate CSPResNext50 performs better on object classification(ImageNet dataset) when compared to CSPDarknet53. On the other hand, the CSPDarknet53 is better in comparison with CSPResNext50 in terms of object detection(MS COCO dataset).

  • A reference model which gives superior performance results on classification task need not be the best when it comes to object detection. The object detection needs (1) Higher input network size (resolution) — for detecting multiple small-sized objects (2) More layers — for a higher receptive field to cover the increased size of input network (3) More parameters — for greater capacity of a model to detect multiple objects of different sizes in a single image.

Selection of BoF and BoS: The following are the upgrades made in terms of freebies & specials.

  • Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish or Mish
  • Bounding box regression loss: MSE, IoU, GIoU, CIoU & DIoU
  • Data Augmentation: CutOut, MixUp and CutMix
  • Regularization method: DropOut, DropPath, Spatial DropOut or DropBlock
  • Normalization of the network activations by their mean & variance: Batch Normalization(BN), Cross-GPU Batch Normalization(CGBN or SyncBN), Filter Response Normalization(FRN) or Cross-Iteration Batch Normalization(CBN)
  • Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections or Cross stage partial connections(CSP)

Since PReLU & SELU are difficult to train and ReLU6 is exclusively designed for quantization network, these activation functions are ruled out. DropBlock is chosen as a regularization strategy, as it outperforms other techniques. In terms of normalization, syncBN is not considered as the objective is to train on a single GPU.

Additional Improvements: The extra add-ons for single GPU execution are as follows,

  • New data augmentation techniques such as Mosaic and Self-Adversarial Training(SAT) are included.
  • Some of the existing methods are altered to make the detection process more efficient. For instance, we have modified SAM, modified PAN, and Cross mini-Batch Normalization(CmBN).
  • Mosaic represents a new data augmentation method that mixes 4 training images. However, CutMix mixes only 2 input images. This allows the detection of objects outside their existing context.

Recommended Reading:

https://towardsdatascience.com/data-augmentation-in-yolov4-c16bd22b2617#:~:text=Mosaic%20data%20augmentation%20%E2%80%94%20Mosaic%20data,a%20smaller%20scale%20than%20normal.

AI Enthusiast | Blogger✍

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store