As you have already read the article's heading, so no further wasting any time lets get dive into understanding the **ACTIVATION FUNCTION**.

While building the neural network, one of the mandatory choices is to make is which Activation Function is to use in the neural network. In fact, it is an unavoidable choice because activation function are the foundations of the neural networks to learn and approximate any kind of complex relationship between variables.

### What is "Activation Function"?

The** Activation Function** decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. The purpose of the **activation function** is to introduce *non*-*linearity *into the output of a neuron

* Roles and Responsibility of the activation are to normalize, restrict, non-linearize, or filter the data set.*

### Can we do without an Activation Function?

We understand that using an activation function introduces an additional step at each layer during the forward propagation. Now the question is – if the activation function increases the complexity so much, can we do without an activation function?

Imagine a neural network without the activation functions. In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. Although linear transformations make the neural network simpler, *this network would be less powerful and will not be able to learn the complex patterns from the data.*

A neural network without an activation function is essentially just a linear regression model.

### Activation Function Types :-

### Linear Function -

*y = mx+c ( m is line equation represents W(Weights) and c is represented as b(bias) in neural nets so the equation can be modified as y = Wx+b)*

**Pros**

**Cons**

### Binary Step Function -

Binary Step Function is widely known as "**Threshold Function**"

This activation function is best used to classify inputs such as pictures of cats (so fluffy!) and birds and differentiating between the two. However, it should only be used at the output nodes of the neural network, not the hidden layers.

**Pros**

**Cons**

### Non - Linear Function -

The graph of a linear function is a line. Thus, the graph of a nonlinear function is not a line. Linear functions have a constant slope, so nonlinear functions have a slope that varies between points. Also, they have the opposite properties of a linear function.

### Different types of Non-Linear Function

### 1. Sigmoid (Logistic) Activation Function

The S-shaped function has proven to work great with two layers and 3 layers of neural network particularly classification problems. Notice the hill-shaped derivative of the function which pushes the network to “move down the hill” to either side, giving the network more distinction when classifying.

**Pros**

**Cons**

*“zero-centric”*. This makes the gradient updates go too far in different directions 0 < output < 1, and it makes optimization harder.

**Vanishing Gradient Problem.**

### Tanh Activation Function

Tanh is the modified version of the Sigmoid activation function, but have similar properties of Sigmoid activation function

**Pros**

**Cons**

**Vanishing gradient and Exploding gradient problem.**

Tanh is preferred over the sigmoid function since it is zero centered and the gradient are not restricted to move in a certain direction

### 3. ReLu Activation Function (ReLu- Rectified Linear Unit)

**Pros**

**Cons**

**zero-centric**. This makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.

### 4. Leaky ReLu Activation Function

This is an attempt to fix the dying ReLU problem as the gradient becomes a small value, **α **instead of 0.

**Pros**

**Cons**

### 5. ELU (Exponential Linear Units) Activation Function

**Pros**

**Cons**

### 6. P ReLu (Paramagnetic ReLu) Activation Function

Instead of multiplying x with a constant term we can multiply it with a “hyperparameter (**α **-trainable parameter)” which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.

**Pros**

**Cons**

### 7. Swish Activation Function

The experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets.

The curve of the Swish function is smooth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outperforms ReLU.

It works well for both positive and negative types of datasets.

### 8.Softmax / Normalized Exponential Function

**Softmax can be described as the combination of multiple sigmoidal functions **

The “softmax” function is also a type of sigmoid function but it is very useful to handle multi-class classification problems.

“Softmax function returns the probability for a data point belonging to each individual class.”

### 9. Softplus Activation Function

The softplus function is similar to the ReLU function, but it is relatively smoother.Function of Softplus or SmoothRelu ** f(x) = ln(1+exp x).**

Derivative of the Softplus function is f’(x) is logistic regression **(1/(1+exp x)).**

Function value ranges from (0, + inf).

**Cons**

### 10. Maxout Activation Function

The **Maxout activation** is a generalization of the ReLU and the leaky ReLU **functions**.

It is a learnable **activation function**.

It is a piecewise linear **function **that returns the maximum of the inputs.

### Which one is better to use? How to choose the right one?

To be honest there is no hard and fast rule to choose the activation function.

Each activation function as its own pro’s and cons.

All the good and bad will be decided based on the trails.

If you have any concern or wanna contact me , you can comment down below or you contact me on LinkedIn