Day 97(DL) — Simple Online And Real-time Tracking

We’ll explore one of the research papers that explains how to leverage Kalman filters for real-time object tracking.

SORT Architecture: A simple framework that utilizes the Kalman filtering in image space and applying the Hungarian method to identify data association between the frames(using association metrics that evaluates the bounding box overlap). The technique might look effortless but the outcome is high frame rates. It is based on the implementation of a tracking-by-detection framework for MOT(Multiple Object Tracking) which identify objects in every frame (indicated as a bounding box).

The architecture is designed for online tracking which requires only detection details of current and previous frames. MOT requirement is also a data association use case where the objective is to relate detections across the time frames(video).

Fig1 — comparison of various MOT models — original paper

The core idea behind this technique is to utilize both the position and size of the bounding box for motion estimation and data association. The occlusions (both short-term and long-term) are excluded for simplicity.

The prediction of the movement is done by the Kalman filter, while the Hungarian method is applied for data association. The approach has been applied to pedestrian tracking in different environments for performance comparison. But due to the flexibility of CNN based detectors, it can be generalized to other objects. The below are the principal logical points.

  • Detection
  • Propagating object states into future frames(Kalman filter)
  • Association of current detections with existing objects(Hungarian algorithm)
  • Managing the lifespan of the tracked objects

Detection: The Faster RCNN is used as an object detector. Faster RCNN falls under the category of two-stage object detector which identifies ROI using region proposal network(RPN), followed by the head, outputs the bounding box coordinates, confidence scores and the class probabilities.

The two backbone networks applied are ZF Net and the VGG16. The pre-trained weights(on PASCAL VOC datasets) have been used in the Faster RCNN architecture. The choice of detection framework has a significant influence on the tracker performance.

Estimation Model: The propagation to the next frame is the target unique identity number. The displacement of objects in the consecutive frames is approximated by a linear constant velocity model which is independent of other objects and camera motion. The state of each target is denoted as,

x = [u, v, s, r, u,˙ v,˙ s˙] T where u & v represents the centre of the bounding box r&u indicate scale and aspect ratio.

Scenario1(when an object is detected): Using the Kalman filter framework and the detected bounding box, the future state is predicted. Here the velocities are computed optimally.

Scenario2(No detection associated with the target): The state is simply passed without correction (i.e) using a simple linear velocity model.

Data Association: Using the new target states, we predict the bounding boxes that are later on compared with the detected boxes in the current timeframe. The IOU metric and the Hungarian algorithm is utilised for choosing the optimum box to pass on the identity. There is also a cut-off limit for IOU is set and any boxes that result in an IOU score below this limit will be dropped out.

Recommended Reading:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store