##deeplearning ##computervision ##objectdetection ##featureextraction ##fastfeaturepyramid

rakesh baral Mar 26 2021 · 2 min read
Share this

This article is based on the paper ‘Fast Feature Pyramids for Object Detection’  written by Piotr Dollar et al.

The fundamental requirement for an object detection algorithm is to increase the accuracy without sacrificing the speed. For achieving this, we can add some novelty to the existing feature extraction system sacrificing the accuracy a bit to increase the speed of the computation. This is what the Fast Feature Pyramid fulfills. The complete idea is based upon below.

Multi-resolution image features may be approximated via extrapolation from nearby scales, rather than being computed  Explicitly’

So as per the above concept, we can avoid computing gradients over a finely sampled image pyramid. Image gradient measures the directional intensity change of an image and gives the distribution of the gradient angles of the complete image pixel by pixel. The magnitude and direction is calculated by below formula.

[Polar co-ordinates magnitude and orientation calculation in pixel-level]

Here gx and gy are the change of pixel values in horizontal and vertical directions.

For a upsampled image, the sum of gradient magnitudes in the original  and up sampled image should be related by about a factor of k, where k is the degree of upsampling.It can be done by spreading the pixel level values to all the corresponding smaller pixel values.

For a downsampling image, the information will be lost which is found to be consistent. It leads to the measured gradients undershooting the extrapolated gradients. In other words, the predicted and actual values differ by a nearly constant amount.

[The  upsampled and downsampled image creates the approximated similar kind of histogram of gradients. For both the cases  2x and 0.34x of each bin of the original image  is taken considering the consistent amount of loss occurs while downsampling. ]


In order to avoid above limitations, the power law for the natural images is applied which gives as below.

[Numerator and denominator are the same function of a particular channel applied on the pixel level of the same image at scales s1 and s2.]

λΩ  can be calculated form the training observations  and they are constant for the particular channel. E is the error term associated with the equation to signify the deviation from the power law findings.

As before, let Is denote I captured at scale s and R(I, s) denote I resampled by s. Suppose we have computed C = Ω(I); can we predict the channel image Cs = Ω(Is) at a new scale s using only C?

The standard approach is to compute Cs = Ω(R(I, s)), ignoring the information contained in C = Ω(I). Instead, we propose the following approximation derived from the power law:

[Function applied for multi scaling derived from Power Law]
[Architectural level difference between Standard and Proposed approach]

Since the error term increases with increase in scales , we can interpolate the octaves(scales multiple with 2,4,8,16 etc.) and formulate nearby scales form that which will be the best tradeoff between speed and accuracy. In this method, the complexity is calculated as 1.5 multiple of n square in a n*n dimension image.

The standard method for the scale pyramid creation takes 100fps and 8 scales per octave creation takes 8fps whereas fast pyramid  creation using the power law takes only 50fps.

The above technique is better understandable form the below image.

[Octaves are calculated and nearby scales are interpolated to maintain the tradeoff ]


Improvements in the performance of visual recognition systems in the past decade have in part come from the realization that fifinely sampled pyramids of image

features provide a good front-end for image analysis. It is widely believed that the price to be paid for improved performance is sharply increased computational costs.

Effectiveness of the fast pyramid technique in the context of object detection can be applied to multiple algorithms like Aggregated Channel Features(ACF),Integral Channel Features(ICF) ,Deformable Part Models(DPM) etc. And also it is not restricted to object detection, can be applied to various visual recognition algorithms.

Read next