Pre-requisite to understand Neural Machine Translation
1.) Artificial Neural Network Basics
2.) RNN - LSTM or RNN - GRU
3.) Encoder Decoder
Problems with Encoder Decoder
1.) We need to bring down sentences of source Input into fixed length vector where This may lead to difficulty in handling long sentences resulting in loosing Information or We can also says that loosing context information instead we having Bi-directional RNN-LSTM cells.
2.) If We have test sentences longer than training corpus that basically reducing the performance as length of input sentences increases.
3.) Context of all Encoder RNN cells is consolidated and Output of Last Cell of Encoder have feeded to Decoder's first RNN Cells therefore Model not clearly able to understand which are the most significant words to predict target word.
To Overcome this Issue, An Attention cames into picture where we are trying to find out which of Input words of given sentence are responsible for generating each target output word.
Idea to solve this issue, Researchers came up with an Idea in 2016, to add a Neural Network between Encoder and Decoder which is responsible for finding out the most significant words or we can say focused words of Input Sentence to predict target output to achieve the concept of Attention or Self-Attention
Generalized Architecture

Generalized Representation of how Attention has played role in Hindi to English Translation:
Deep Dive into Encoder Decoder with Attention Neural Network

In this figure , We can see that their are 4 Inputs : { X1,X2,X3,X4 } is feeded to Bi-Directional LSTM Cells which is responsible for generating 4 Outputs : {O1,O2,O3,O4}.
Decoder RNN cells s i-th is computed as :
where ,
Those outputs which are generated from Encoder Cells is the Input for the intermediate neural network with trainable parameter is alpha value denoted as { α }.So context vector denoted as { C } which is computed as weighted sum of Outputs of Encoder cells with Trainable alpha value .This equation depicts like basic neural network equation
where ,
Concept to be Cleared :
In the figure 3. We can see that for getting context vector C2.The output of X2 denoted as O2 and its alpha weights gets nullified because Output O2 is not playing any significant role only the ouput {O1,O2,O3} has attention.Like this, same as happened with context vector C3 where only {O1 and O2} is responsible or have a focus for producing Context Vector C3
To compute the value of α is : exponent of each e divided by submission of exponent every e value
To compute the value of e is : This function is denoted as Attention Function which is responsible for finding out the most significant words to predict next target word
where,
Concept to be Cleared : Researcher parameterized α as feedforward neural network which is jointly trained with all the components of proposed system which allows the gradient of the cost function to be backpropagated through and this gradient can be used to train the value of α (when we compute value of α it is termed as Alignment Model) as well as whole translation model jointly, therefore it is termed as Jointly - Alignment Translation Model.
Thank You for reading !
Feel Free to add your Feedback and Connect with me on LinkedIn
https://www.linkedin.com/in/sonihimanshu1/
Happy Deep Learning !
References :
Neural Machine Translation by Jointly Learning to Align and Translate
https://arxiv.org/abs/1409.0473
Special Thanks to
Sudhanshu Kumar Sir.
Chief AI Engineer and CEO of iNeuron.ai