10 Activation Functions Every Data Scientist Should Know About

#deeplearning #activationfunction

Sanket Kangle Sept 18 2020 · 7 min read
Share this
Image from author

What is an activation function?

In simplest terms, activation function is just a mathematical function which receives input, does some predefined mathematical operations on it and produces resultant as an output. The term ‘activation’ comes from the fact that output of these functions defines whether a neuron is active or not. Activation function is also used for normalization, regularization of data and introducing non-linearization in neural network.

Following are some important activation functions every data scientist must be aware of:

1. Sigmoid function

The Sigmoid function has a characteristic S-shaped curve, it is bounded and has a non-negative derivative at each point with exactly one inflection point.

Image from author

In the image above, the red curve is of a sigmoid function and the green curve is its derivative.

Mathematical function:

From function above, it is evident that as exp(-x) can never be negative, which means the denominator will always be greater than 1. Hence, the value of function f(x) is always positive but less than 1.

Acceptable input:

A real number ranging from -inf to +inf.

Output range:

As x tends to -inf, output tends to 0 and as x tends to +inf, output tends to 1.

Derivative:

A derivative of the sigmoid function is smooth, uniform across the y-axis, and always positive. in terms of sigmoid function itself, it is given as follows

Pros of Sigmoid Function:

  • Normalizes the data in range (0, 1)
  • Provides continuous output which is always differentiable
  • Can be used on outward layer for clear prediction
  • Good for binary classification problems
  • Cons of Sigmoid Function:

  • For extreme positive and negative values output tends to 1 and 0 respectively, hence not good performance on extreme datapoints
  • In backward propagation, it is prone to vanishing gradient problem
  • It is not a zero-centric function
  • As it is an exponential function, it is computationally expensive
  • It is general misconception that sigmoid is probability function, but it is only probability like function (the sum outputs for all the inputs is not necessarily 1).

    2. Tanh/Hyperbolic tangent

    It is similar to a sigmoid function, just it normalizes the data between (-1, 1) instead of (0, 1).

    Image from author

    In the image above, the red curve is of a tanh function and the green curve is its derivative.

    Mathematical function:

    Acceptable input:

    A real number ranging from -inf to +inf.

    Output range:

    As x tends to -inf, output tends to -1 and as x tends to +inf, output tends to 1.

    Derivative:

    Pros of tanh function:

  • It is zero-centric function
  • Its normalizes all data in range (-1, 1 )
  • For binary classification, combination of tanh at input layer and sigmoid at output layer works well
  • Cons of tanh function:

  • It also faces vanishing gradient problem
  • tanh is also computationally expensive as it is exponential function
  • 3. ReLU : Rectified Linear Unit

    In the normalization process of sigmoid and tanh function, they tend to loose some information related to magnitude of variables to tackle this problem, ReLU was discovered.

    Mathematical function:

    Image from author

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    For all negative number, output is zero and for positive numbers, output is same number

    Derivative:

    For all negative numbers, derivative is 0 and for positive numbers, derivative is 1. It is a step function as shown in figure below.

    Image from author

    Pros of ReLU function:

  • It is computationally cheap as it is very easy function
  • Does not have gradient vanishing problem like tanh or sigmoid functions
  • Can give true 0 output, which sigmoid cannot give
  • It converges to minima faster than sigmoid and tanh
  • Cons of ReLU function:

  • Output is 0 for all negative values
  • Not a zero-centric function
  • Does not have smooth derivative throughout the range
  • There are many variants of ReLU, some of them are discussed below

    4. Leaky ReLU : Leaky Rectified Linear Unit

    Instead of discarding negative inputs altogether, leaky ReLU provides a small output for them too.

    Mathematical function:

    For negative number instead of zero, leaky ReLU gives output that is 0.01 times input and for positive number, it gives output as same as input.

    Image from author

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    Derivative:

    For all negative numbers, derivative is 0.01 and for positive numbers, derivative is 1. It is a step function as shown in figure below.

    Image from author
    Note: The derivative for x < 0 should be 0.01, which by mistake is shown -0.01  

    Pros of Leaky ReLU:

  • Inexpensive computation, same as ReLU
  • Provides output for negative values as well
  • Cons of Leaky ReLU:

  • Not a zero-centric function
  • One point of indefinite derivative at 0
  • 5. P-ReLU : Parametric Rectified Linear Unit

    For negative inputs, instead of 0.01 factor, other parameter is used. It is also called as randomized ReLU.

    Mathematical function:

    for a = 0, it is ReLU
    for a =0.01, it is leaky ReLU
    “a” is learnable parameter

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    Derivative:

    It is similar to Leaky ReLU, just the slope of gradient for negative inputs changes w.r.t. value of ‘a’.

    Pros of P-ReLU:

  • on top of Leaky ReLU, Magnitude of output for negative inputs can be regularized using ‘a’.
  • Cons of P-ReLU:

  • Same as the ReLU and Leaky ReLU.
  • 6. ELU : Exponential Linear Unit

    ReLU, Leaky ReLU, P-ReLU have sharp corner on curve at zero. To get a smoother curve around zero, ELU comes handy.

    Mathematical function:

    Image from author

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    Derivative:

    Image from author

    Pros of ELU:

  • The curve of function is smoother around 0 than other ReLU variants
  • It can provide negative outputs as well
  • Cons of ELU:

  • For negative inputs, it is computationally expensive as it is exponential for that range
  • 7. Softplus

    It is activation function whose graph is similar to a ReLU function but smooth throughout. In figure below, the curve in red color is of softplus and blue one is ReLU.

    Image from author

    Mathematical function:

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    Derivative:

    Derivative of softplus function is sigmoid function.

    Image from author

    Pros of Softplus function:

  • Smoother gradient than ReLU
  • Negative inputs also produce meaningful output
  • Cons of Softplus function:

  • It is computationally expensive than ReLU
  • 8. Swish function

    This function is suggested by Google Brain team. It is non-monotonic, smooth and self gated function.

    Mathematical function:

    Image from author

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    Derivative:

    Image from author

    Pros of Swish function:

  • It does not have gradient vanishing problem as sigmoid function
  • It performs better than ReLU
  • Cons of Swish function:

  • It is computationally expensive than ReLU
  • 9. Maxout function

    Name of the maxout function is very intuitive, ‘max+out’, and that is exactly what it does! It selects the input which is maximum and gives it as an output. It is a learnable activation function.

    Mathematical function:

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    Derivative:

    Pros of Maxout function:

  • Computationally inexpensive
  • It is learnable activation function
  • Considers most dominating input only
  • Cons of Maxout function:

  • Number of parameters to be trained get double than ReLU or Leaky ReLU
  • 10. Softmax function

    Softmax function is probabilistic function used in outer layer for multi-class classification problems. the sum of output for each input is 1 in this case.

    Mathematical function:

    Acceptable input:

    Real number ranging from -inf to +inf.

    Output range:

    As it gives probabilistic output, it is always in range of 0 to 1.

    Derivative:

    Pros of Softmax function:

  • Useful for output layer of multi-class classification problem
  • Provides Probabilistic output
  • Normalizes data between zero and one
  • Cons of Softmax function:

  • Only good for output layer of multi-class classification problem
  • Computationally expensive
  • * * *

      Thanks for reading the article! Wanna connect with me?
    Here is a link to my Linkedin Profile  

    Comments
    Read next