Sequence Modeling#
Recurrent Neural Network (RNN)#
Sequence Loss#
In the previous Chapter, we observe that the Cross Entropy (CE) loss is given by the following equation:
where \(\hat{\bsf{y}}\) is the class probability predicted by the model.
If we represent \(\hat{\bsf{y}}\) as the predicted probability, then we obtain the following equation:
In the sequence modelling, we use the same equation as in (11). Note that in the above equation (12), \(\bsf{y}\) is not the ground-truth label, but just a dummy vector variable.
Now, the input and the predicted output may be sequences as follows:
Using this, the equation (12) may be expressed as:
Usually, it is not tractable to directly calculate the sequence probability in (13). Thus, we usually take the conditional independence assumption:
By putting (14) into (11), we obtain the following sequence loss:
Even though we use a fixed value of \(L\) in (15), the length may vary for each example in the training data set. When there are \(V\) classes, then (15) is represented by:
In Tensorflow, it is implemented as tfa.seq2seq.sequence_loss method.
Back-Propagation Through Time (BPTT)#
For deep neural network models, we may not directly obtain the gradient. Thus we use the chain rule to obtain the gradient with respect to a certain