Day 85(DL) — YOLOv3: An Incremental Improvement

The next version in the series of YOLO models is v3 which incorporated design changes in the network. The new modifications resulted in a bigger network when compared to the earlier version but still more accurate.

Bounding Box Prediction: It takes the same path as YOLO9000 for bounding box prediction(i.e dimension clusters as anchor boxes). The four coordinates of the box correspond to tx, ty, tw and th. The cell offset from the top left corner is considered as (cx and cy) and the prior width & height is denoted by pw and ph. The predicted values are as below,

The prediction formula is taken from the original paper
  • The mean squared loss is utilised to find the deviation of predicted output from the ground truth. YOLOv3 uses logistic regression to forecast the objectness score for each bounding box. The output will be ‘1’ for a bb box if it has more overlap with ground truth when compared to any other priors.
  • The threshold limit for the bounding box comparison is 0.5 and the new architecture assigns only one prior to the ground truth(based on the cutoff and the highest value of IOU). If a bounding box prior is not assigned to a ground truth object then the only value that gets impacted is the objectness(neither the box coordinates nor the class probabilities).

Class Prediction: The classes present in each bounding box is computed by a multilabel classification approach. Independent logistic classifiers are used for object prediction and the binary cross-entropy for backpropagation. This allows a single box to contain multiple objects unlike softmax(confines each box to only one class).

Predictions Across Scales: YOLOV3 predicts boxes at varying scales, this is followed by extracting features from those scales. The base feature extractor is followed by several convolutional layers, in which the last layer outputs 3-d tensor encoding corresponds to the bounding box, objectness and class predictions. For the coco dataset, 3 bounding boxes are predicted for each feature point, which gives the tensor size as N X N X[3 * (4 + 1+ 80)] corresponding to 4 box coordinates, 1 objectness and 80 class probabilities.

  • For predicting three bounding boxes(multiscale), features from different layers are concatenated to include both the semantic information along with the fine-grained feature details.
  • Totally 9 anchor boxes are used and a group of 3 is used for every scale which is usually done by sorting the box based on the dimension. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326).

Feature Extractor: A new network is introduced for performing feature extraction. The network is an enhanced variant of Darknet-19 used in YOLOv2 and has 53 convolutional layers thus getting the name as Darknet-53.

Fig 1 — shows Darknet-53 from original paper
Fig2 — shows performance comparison across different detection algorithm from original paper

Notes: Please go through the original paper for some of the tried out mechanisms that did not work.

Recommended Reading:

https://arxiv.org/pdf/1804.02767.pdf

AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store