Object Detection Techniques in Deep Learning

Kirthan s May 01 2021 · 4 min read
Share this


                                                                       RCNN Architecture

 Region Proposal. Generate and extract category independent region proposals, e.g. candidate bounding boxes.

Feature Extractor. Extract feature from each candidate region, e.g. using a deep  convolutional neural network.

Classifier. Classify features as one of the known class, e.g. linear SVM classifier model

The Region-based Convolutional Network method (RCNN) is a combination of region proposals with Convolution Neural Networks (CNNs). R-CNN helps in localising objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data. It achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN has the capability to scale to thousands of object classes without resorting to approximate techniques, including hashing.

A computer vision technique is used to propose candidate regions or bounding boxes of potential objects in the image called “selective search,” although the flexibility of the design allows other region proposal algorithms to be used.

The feature extractor used by the model was the AlexNet deep CNN that won the ILSVRC-2012 image classification competition. The output of the CNN was a 4,096 element vector that describes the contents of the image that is fed to a linear SVM for classification, specifically one SVM is trained for each known class

2.Fast R-CNN

                                                   Fast RCNN Architecture

Training is a multi-stage pipeline. Involves the preparation and operation of three separate models.

Training is expensive in space and time. Training a deep CNN on so many region proposals per image is very slow.

Object detection is slow. Make predictions using a deep CNN on so many region proposals is very slow

Fast R-CNN is proposed as a single model instead of a pipeline to learn and output regions and classifications directly.

The architecture of the model takes the photograph a set of region proposals as input that are passed through a deep convolutional neural network. A pre-trained CNN, such as a VGG-16, is used for feature extraction. The end of the deep CNN is a custom layer called a Region of Interest Pooling Layer, or RoI Pooling, that extracts features specific for a given input candidate region.

The output of the CNN is then interpreted by a fully connected layer then the model bifurcates into two outputs, one for the class prediction via a SoftMax layer, and another with a linear output for the bounding box. This process is then repeated multiple times for each region of interest in a given image

3.Faster R-CNN

                                                       Faster RCNN Architecture 

Faster R-CNN is an object detection algorithm that is similar to R-CNN. This algorithm utilises the Region Proposal Network (RPN) that shares full-image convolutional features with the detection network in a cost-effective manner than R-CNN and Fast R-CNN. A Region Proposal Network is basically a fully convolutional network that simultaneously predicts the object bounds as well as objectness scores at each position of the object and is trained end-to-end to generate high-quality region proposals, which are then used by Fast R-CNN for detection of objects

The architecture was designed to both propose and refine region proposals as part of the training process, referred to as a Region Proposal Network, or RPN. These regions are then used in concert with a Fast R-CNN model in a single model design. These improvements both reduce the number of region proposals and accelerate the test-time operation of the model to near real-time with then state-of-the-art performance.

Although it is a single unified model, the architecture is comprised of two modules:

Region Proposal Network. Convolutional neural network for proposing regions and the type of object to consider in the region.

Fast R-CNN. Convolutional neural network for extracting features from the proposed regions and outputting the bounding box and class labels.

The RPN works by taking the output of a pre-trained deep CNN, such as VGG-16, and passing a small network over the feature map and outputting multiple region proposals and a class prediction for each. Region proposals are bounding boxes, based on so-called anchor boxes or pre-defined shapes designed to accelerate and improve the proposal of regions. The class prediction is binary, indicating the presence of an object, or not, so-called “objectless” of the proposed region.


                                                              YOLO detection 

It turns out that we have YOLO (You Only Look Once) which is much more accurate and faster than the sliding window algorithm. It is based on only a minor tweak on the top of algorithms that we already know. The idea is to divide the image into multiple grids. Then we change the label of our data such that we implement both localization and classification algorithm for each grid cell.

The model works by first splitting the input image into a grid of cells, where each cell is responsible for predicting a bounding box if the center of a bounding box falls within the cell. Each grid cell predicts a bounding box involving the x, y coordinate and the width and height and the confidence. A class prediction is also based on each cell.

For example, an image may be divided into a 7×7 grid and each cell in the grid may predict 2 bounding boxes, resulting in 94 proposed bounding box predictions. The class probabilities map and the bounding boxes with confidences are then combined into a final set of bounding boxes and class labels. The image taken from the paper below summarizes the two outputs of the model.

5.Single Shot Detector (SSD)

                                                                        SSD Architecture 

Single Shot Detector (SSD) is a method for detecting objects in images using a single deep neural network. The SSD approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios. After discretizing, the method scales per feature map location. The Single Shot Detector network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes

Advantages of SSD: –

· SSD completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. 

· Easy to train and straightforward to integrate into systems that require a detection component. 

· SSD has competitive accuracy to methods that utilize an additional object proposal step, and it is much faster while providing a unified framework for both training and inference

Happy Learning !!

Read next