Object detectors such as the RCNN series(two-stage) and the YOLOs(one-stage) are predominantly applied on the Computer Vision tasks including object identification in autonomous driving and also tracking. The general working principle behind the detection is to fit the objects inside a bounding box and classify the boxes into various categories. For instance, given a picture, we would like to detect all the pedestrians(class1) and the cars(class2). In this scenario, we have two classes and based on the confidence(whether the box contains an object) and class score, the decision is taken on whether to retain it or not.
The second step in object detection is post-processing that follows procedures such as Non-max suppression, where the overlapping boxes are excluded based on the IOU(IntersectionOverUnion) value computed. Higher IOU between boxes signifies the representation expressed by the boxes are the same. So, we can say the bounding box procedure is a bit lengthy process that involves multiple checks (both preprocessing and postprocessing) before actually arriving at the intended output.
In this research paper, a competitive alternate has been suggested where the objects are symbolized by a single point corresponding to the bounding box centre point. Depending upon the use case, the remaining details such as object size, dimension, orientation and pose are retrieved from the image features by fixing the centre point as the reference. In contrast to identifying bb boxes, the object detection requirement becomes locating the key points.
The basic process of the New approach: The input image is passed on to a CNN that generates a heatmap. Peaks(high intensity) values denote the object centres. The image features corresponding to these peaks assist in predicting the bounding box height and width of the respective objects. During the inference, a single forward pass is implemented without any post-processing. The latest architecture is called ‘CenterNet’.
We can relate the new method somewhat similar to one stage object detection where the learning starts with a set of anchors. Some of the noteworthy features of the KeyPoint method,
- Anchor assignment is based on only the location, unlike the box overlaps done in the one-stage object detectors
- There is no threshold explicitly set to differentiate the foreground from the background
- Only one positive anchor is associated with an object, thus it eliminates the need for NMS
- High output resolution has been employed(stride of 4) on contrary to the standard detectors that use an output stride of 16(low resolution)
Objects as Points
Hyperparameters: Let’s consider an image I with height ‘H’, width ‘W’ and the number of channels as 3. The objective is to generate a keypoint heatmap [0,1] of size W/R x H/R x C, where ‘R’ is the output stride and C the number of key points. The value of ‘C’ varies depending upon the problem at hand. To cite an example, C=17 in the case of human pose estimation corresponding to the human joints. On the other hand, C=80 covering various object categories in object detection.
The output stride usually R=4 represents how the input image gets downsampled to result in outputs. The predicted value Y=1 implies it is a keypoint whereas 0 is the background.
The next step is to understand the loss function and training which we’ll see in the subsequent post.