In this article, we will see what are Convolutional Neural Networks, ConvNets in short.

In deep learning, a convolutional neural network (CNN, or **ConvNet**) is a class of deep neural networks, most commonly applied to analyzing visual imagery.

In CNN input is an image or more specifically we can say that it is a 3D matrix.

### Now let's see the Convolutional Neural Network

We can say Convolutional Neural Network has usually 3 layers:-

**Convolutional Layer (CONV)**

**Pooling Layer (POOL)**

**Fully Connected Layer (FC)**

Now let's look at each one of them in details-

### Convolutional Layer

Convolutional Layer is the first layer in a CNN.

It gets as input a matrix of the dimensions [h1 * w1 * d1], which is the blue matrix in the above image.

Next, we have **kernels** (filters).

A **kernel** is a matrix with the dimensions [h2 * w2 * d1], which is one yellow cuboid of the multiple cuboids (kernels) stacked on top of each other (in the kernels layer) in the above image.

For each convolutional layer, there are multiple kernels stacked on top of each other, which is of dimensions [h2 * w2 * d2], where d2 is the number of kernels.

For each kernel, we have its respective bias, which is a scalar quantity.

And then, we have an output for this layer, the green matrix which has dimensions [h3 * w3 * d2].

Alright, so we have inputs, kernels, and outputs. Now let’s look at what happens with a 2D input and a 2D kernel

First, we need to agree on a few parameters that define a convolutional layer.

**For each position of the kernel on the image, each number on the kernel gets multiplied with the corresponding number on the input matrix (blue matrix) and then they all are summed up for the value in the corresponding position in the output matrix (green matrix).**

With d1 > 1, the same thing occurs for each of the channels and then they are added up together and then summed up with the bias of the respective filter and this forms the value in the corresponding position of the output matrix.

**Let’s visualize!**

### Pooling Layer

There are many types of pooling and basic one of them are:

The main purpose of a pooling layer is to reduce the number of parameters of the input tensor and thus

- Helps reduce overfitting

- Extract representative features from the input tensor

- Reduces computation and thus aids efficiency

The input to the Pooling layer is tensor.

In the case of Max Pooling, a kernel of size `n*n`

(2x2 in the above example) is moved across the matrix and for each position, the max value is taken and put in the corresponding position of the output matrix.

In case of Average Pooling, a kernel of size `n*n`

is moved across the matrix and for each position the average is taken of all the values and put in the corresponding position of the output matrix.

This is repeated for each channel in the input tensor. And so we get the output tensor.

**Pooling downsamples the image in its height and width but the number of channels(depth) stays the same.**

After the above operation then matrices get flattened to pass in a fully connected layer.

**What is Flattened?***The output from the final (and any) Pooling and Convolutional Layer is a 3-dimensional matrix, to flatten that is to unroll all its values into a vector.*

Let's visualize

### Fully Connected Layer

The fully Connected Layer is simply, feed-forward neural network. Fully Connected Layers form the last few layers in the network. The fully Connected Layer also contains a hidden layer.

The input to the fully connected layer is the output from the final Pooling or Convolutional Layer, which is flattened and then fed into the fully connected layer.

After passing through the fully connected layers, the final layer uses the softmax activation function (instead of ReLU) which is used to get probabilities of the input being in a particular class (classification).

And so finally, we have the probabilities of the object in the image belonging to the different classes!!

*And that is how the Convolutional Neural Network works*!!*And input images get classified as labels!!*

### Case studies

There are several architectures in the field of Convolutional Networks that have a name. The most common are:

**LeNet**. The first successful applications of Convolutional Networks were developed by Yann LeCun in the 1990’s. Of these, the best known is the LeNet architecture that was used to read zip codes, digits, etc.

**AlexNet**. The first work that popularized Convolutional Networks in Computer Vision was the AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and significantly outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26% error). The Network had very similar architecture to LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer always immediately followed by a POOL layer).

**ZF Net**. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZF Net (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.

**GoogLeNet**. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large number of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.

**VGGNet**. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the VGGNet. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Their trained model is available for plug and plays use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). Most of these parameters are in the first fully connected layer, and it was since found that these FC layers can be removed with no performance downgrade, significantly reducing the number of necessary parameters.

**ResNet**. Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features special skip connections and heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming’s presentation (video, slides), and some recent experiments that reproduce these networks in Torch. ResNets are currently by the far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 10, 2016). In particular, also see more recent developments that tweak the original architecture from Kaiming He et al. Identity Mappings in Deep Residual Networks (published March 2016)

**If you
you have any concern or wanna contact me, you can comment down below or you contact me on LinkedIn**