Day 87(DL) — YOLOv4: Optimal Speed and Accuracy of Object Detection — Part 1

Notes: The series on YOLOv4 will be the explanation of the original paper. Most of the content will be referred from the original paper for reference.

We’ve already seen so many upgrades happened to the ‘You Look Only Once’ over a period of time. The next set of enhancements took the model to a completely new horizon in terms of speed, which can be installed in conventional GPUs for real-time object detection. Version 4 was introduced by completely different authors and of course for the greater good, it was made open-source(with a free license).

Fig 1 — shows the comparison between YOLOv4 and other SOTA models from original paper

YOLOv4 runs twice faster than EfficientDet with comparable performance. It improved the YOLOv3’s AP and FPS by 10% and 12% respectively.

Outcomes of the evolution:

  • A superfast and accurate object detector that can be trained on 1080 Ti or 2080 Ti GPU.
  • Bag-of-Freebies and Bag-of-Specials object detection methods are leveraged during the training.
  • Some of the state-of-the-art methods are modified to make the architecture suitable to train on a single GPU. The changes include CBN, PAN and SAM.

Before delving deep into the actual improvements, let’s gain some interesting details regarding the related work.

Object detection models: The models designed for object detection usually comprise of two components (1) a pre-trained(on ImageNet) backbone for feature extraction (2) head for predicting classes and bounding boxes of objects. The choice of the backbone network(for GPU machines) can be from any one of the CNNs such as VGG, ResNet, ResNeXt or DenseNet. Whereas detectors executing on CPU platform, we can choose backbone as SqueezeNet, MobileNet or ShuffleNet.

  • The head part of the network comes in two variants the one-stage and two-stage object detectors. The popular ones that fall into the category of two-stage object detection include the region proposal series R-CNN, fast R-CNN and faster R-CNN. The single-shot detectors have examples such as YOLO, SSD and RentinaNet.
  • Moreover, some of the recent object detectors fuse in additional layers between the backbone and head. These layers are usually used to collect feature maps from different layers/stages. It is also referred to as the neck of the object detector. The neck consists of several bottom-up paths as well as top-down paths. Networks with this kind of mechanisms include Feature Pyramid Network(FPN), Path Aggregation Network(PAN), BIFN and NAS-FPN.
Fig2 — shows typical one-stage Vs two-stage detectors from original paper

Bag of freebies: Usually, the training of object detection happens offline. This allows the researchers to develop better techniques that can be incorporated during the training phase but not at the inference time. Such techniques are called ‘bag of freebies’ which expands the training cost without impacting the computations at the time of prediction. One such method is data augmentation.

  • Data augmentation boosts up the training samples by applying photometric and geometric distortions to the input images. The photometric distortion adjusts the brightness, contrast, hue, saturation, and noise of an image. While geometric distortion does random scaling, cropping, flipping and rotating.
  • In addition to the above-mentioned procedures, other data augmentation methods include simulating object occlusion issues. Some of them are random erase, CutOut, Mixup and CutMix. CutOut randomly selects the rectangle region in an image and fill in a random or complementary value of zero. Mixup uses two images to multiply and superimpose with different coefficient ratios and then adjusts the label with the superimposed ratios.
  • CutMix is to cover the cropped image to the rectangle region of other images and adjusts the label according to the size of the mixing area. on top of these methods, style transfer GAN is employed to reduce the texture bias learned by CNN.
  • Another bag of freebies is the objective function of the bb box regressor. Using Mean squared error loss on the coordinate values consider each of the coordinates as independent points. But this approach does not account for the integrity of the object itself. IoU(intersection over union) loss will be a better alternative approach for backpropagation(as it is also invariant to different scales).
  • The IoU loss has been enhanced to GIoU loss which includes the shape and orientation of an object in addition to the coverage area. The idea is to find the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox and use this BBox as the denominator instead of the union.
  • For DIoU considers the distance of the centre of an object and CIoU simultaneously considers the overlapping area, the distance between centre points and the aspect ratio. CIoU can achieve better convergence speed and accuracy on bb box coordinates.

The rest will be discussed in the upcoming posts.

Recommended Reading:

AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store