Activation Functions : What they are?, & What they do?

#deeplearning #basics #regularization #normalization #nonlinearization

Monis Khan Nov 09 2020 · 7 min read
Share this

What are activation functions?

Activation functions transform the input to a perceptron of a given layer to output. Mathematically speaking if zjM is the input to jth perceptron of Mth layer, then the corresponding output will be:

Or, mathematically speaking, if zjM and ajM are input & output, respectively, of the jth perceptron in the Mth layer, then the activation function f() establishes the relation between them in the following manner:

In a neural network each layer, except the input layer, can have its own /different activation function. Then the above equation, equation (i), can be written as:

What do activation functions do?

Activation functions serve the following purposes:

  • Regularization of input
  • Non linearization of input data
  • Normalization of input data
  • Check if the given input is relevant to model’s prediction
  • Why is regularization needed?

    When you train your neural network on a given dataset, then it learns the underlying patterns that describe the relationship among the data points of the given dataset. Some of these are general patterns, while other are inherent to the data points of the training dataset.

    General patterns are those patterns which would still be present when new data is fed to the network. While patterns inherent to data points of the training dataset are classified as noise.

    Presence of noise leads to model overfitting. Further, our primary & sole aim is to learn the general patterns, hence learning/presence of noise leads to waste of time and computational resources.

    This is where regularization comes in. It restrains the learning process to general patterns and prevents noise from being taken into account i.e. it regulates the learning process. Thus checking model overfitting.

    How does activation function perform regularization?

    Now that we have learned the importance of regularization, let us see how regularization happens in neural networks. Regularization in neural networks is achieved by penalizing the weight matrices. As shown in figure 1.1, the input to a perceptron is given by:

    By penalizing weight matrix, we mean, decrease the value of weights, hence decrease the impact of data points that lead to noise.

    Now, you may ask that weights are tweaked during backward propagation, then how does activation functions come into picture? Bear with me, in few lines we’ll find out the answer. The weights are tweaked as per the following equation:

    Now, ∇E is a function of activation function. Hence activation functions, actively (pun intended) influence the regularization process.

    Why is non-linearization needed?

    Let us play the devil’s advocate and see what happens if the activation functions doesn’t render non linearization i.e. all the activation functions in a neural network are linear functions. I am going to take the example of neural network shown in fig 1.1 to drive home the point. If the activation functions in the aforementioned neural network were linear then equation (ii) could be rewritten as:

    Where α is a constant. Since f()  is a linear function, all it does is that it associates a constant i.e. α, with the input. Therefore for j=1, equation (iv) can be rewritten as:

    As the application of a linear functions over another linear function results in a linear function, the output of neural network represented by fig 1.1 can be represented by the following equation:

     

    As you would have already noticed that equation (vi) is an equation for linear regression. Now, you may ask what the objections are if a neural network translates into a linear regression algorithm. There are 3 main disadvantages:

  • Back propagation is not possible
  • No effect of multiple layers
  • Lacks abstraction
  • Back Propagation is not possible: The weights and the biases from neural network represented by equation (vi) are constants. In back propagation, we take derivative of weights and biases to learn the new weights and biases respectively, and arrive at the optimal values. But the derivative of a constant is 0, hence for neural network described by equation (vi), new weights and biases can’t be learnt. Hence back propagation is not possible.

    No effect of multiple layers: This is self-evident from equation (vi). With linear activation functions, irrespective of how many layers the network you designed has, it would essentially translate into a single layer network, i.e. output layer acting as linear function of input layer.

    Lacks Abstraction: Abstraction is defined as the process of making something easier to understand by ignoring the details that may be unimportant. Abstraction makes our day to day to day tasks doable/possible. For example, something as simple as looking what time your watch shows would have become humongous task, if you had to take in all the details of the watch like- length & width of needles, intricate internal mechanism with small and large gears interlocking through their tooth etc.- to name a few. But you ignore all this (i.e. unnecessary details) & just look at the time.

    Similarly our neural networks, use abstraction to solve complex problems. Let us take simple route finding problem to drive home the point.

    Let us assume that you decided to visit a city in Japan. Your mission is to reach the district headquarters as soon as you left the airport. The person who gave you the mission brief, say Mr M, forgot an important detail that was conveyed to him by a local of the city X, say Mr T. Mr T had told Mr M that one only had to follow the road signs, which not only told which turn you have to take to reach the city centre but also how far you’re from it. Mr M realized his mistake and told you that, though he didn’t remember the exact advice of Mr T, but he can provide you with a dataset through which you could figure out the route to the city centre. Being a proactive employee and AI engineer you readily accepted the challenge of building a model that can accurately predict the route to city centre. But there is one problem, the dataset is in a language which neither you nor google translate understands. You have no clue, which column denotes what and categorical volume are an absolute nightmare.

    Further the dataset has umpteen features, ranging from colour of houses on the way, traffic info, weather info, street vendors on the way, trees planted along the road, specie & age of those trees etc. Hence you don’t know what each column denotes, on top of that most columns are useless.

    If you use linear regression to build your model then it would take genormous amount of effort & ages in feature selection, tuning the model, validation etc., before you arrive at the accurate prediction. If the data of road signs was in form of images, then it would be practically impossible to achieve the desired result.

    On the other hand if you chose to build a neural network, with nonlinear activation functions, then you’ll be able to make accurate predictions in a reasonable amount of time and with much more ease.

    As you would have guessed by now, that we create neural networks to solve those problems with a level of accuracy that other algorithms simply can’t deliver. Even if they could, they would require excessive amount of time & effort. Thus having a linear activation function simply defeats the purpose of a neural network.

    What is normalization & why is it needed?

    Normalization basically means rescaling the data so that it falls in a smaller range. In machine learning problems normalization is used to change the values of numeric columns in the dataset to use a common scale, without distorting the differences in range values or losing information. Suppose you’re working on a problem where one attribute may be in kilograms, other in meters while yet another is present as hours. Owing to the difference in the units they will have difference in magnitude, for ex- change of 1 unit in hour column may correspond to change of 2000 m in distance column.

    Machine learning equations operate on certain basic assumptions and difference in magnitude of variance in input variables could adversely affect their performance. For example distance based algorithms and gradient descent algorithms malfunction when input variables are on significantly different scales. What we mean by malfunction is that the learning is dominated by variable whose magnitude of variation is higher than others.

    Let us elaborate on it by taking the example of a Euclidian distance based algorithm:

    As you can see here the variable x dominates the learning owing to having significantly higher magnitude of variation than the variables y & z. Now if x, y & z were brought on same scale then they would be given equal importance by the learning algorithm. Same is true for gradient descent based algorithms. Neural networks are based on gradient descent and hence require normalization of inputs.

    How do activation functions perform normalization?

    Some of the activation functions rescale the input values between -1 & 1 or 0 & 1. Thus they rescale all the input variables to same range. It should be noted that not all activation functions perform normalization. While tanh & sigmoid perform normalization across the board, variants of ReLu don’t change the positive input values.

    How does activation function checks if an input is relevant to model’s prediction or not?

    Activations do it by deciding whether a perceptron should be activated or not. By activation of a perceptron we mean, whether output from that perceptron is passed on as input to the perceptrons of the proceeding layer.

    Let us take the example of the perceptron from figure 1.1. The input to perceptron, j=1, of layer M is given by equation (v), i.e:

    Now, if a1M is above a certain threshold then only the perceptron, j=1, is activated. Every activation function has its own characteristic threshold.

    Comments
    Read next