Sequence Modeling#

Recurrent Neural Network (RNN)#

Sequence Loss#

In the previous Chapter, we observe that the Cross Entropy (CE) loss is given by the following equation:

(11)#\[\begin{split} \mathbb{L}_{\text{CE}} & = -E_{\bsf{y} \sim \text{training_data}} \left[ \log \hat{\bsf{y}} \right]. \\\end{split}\]

where \(\hat{\bsf{y}}\) is the class probability predicted by the model.

If we represent \(\hat{\bsf{y}}\) as the predicted probability, then we obtain the following equation:

(12)#\[ \hat{\bsf{y}} = \hat{p}(\bsf{y} | \bsf{x}).\]

In the sequence modelling, we use the same equation as in (11). Note that in the above equation (12), \(\bsf{y}\) is not the ground-truth label, but just a dummy vector variable.

Now, the input and the predicted output may be sequences as follows:

\[\begin{split} \bsf{x}_{0:M} & = \left[ \bsf{x}_0,\, \bsf{x}_1,\, \bsf{x}_2,\, \cdots, \, \bsf{x}_{M-1} \right] \\ \bsf{y}_{0:L} & = \left[ \bsf{y}_0,\, \bsf{y}_1,\, \bsf{y}_2,\, \cdots, \, \bsf{y}_{L-1} \right]\end{split}\]

Using this, the equation (12) may be expressed as:

(13)#\[ \hat{\bsf{y}}_{0:L} = \hat{p}(\bsf{y}_{0:L} | \bsf{x}_{0:M}).\]

Usually, it is not tractable to directly calculate the sequence probability in (13). Thus, we usually take the conditional independence assumption:

(14)#\[\begin{split} \hat{\bsf{y}}_{0:L} & = \Pi_{l=0}^{L-1} \hat{p}(\bsf{y}_l | \bsf{x}_{0:M}) \\ & = \Pi_{l=0}^{L-1} \hat{y}_l\end{split}\]

By putting (14) into (11), we obtain the following sequence loss:

(15)#\[\begin{split} \mathbb{L}_{\text{CE}} & = -E_{\bsf{y}_{0:L} \sim \text{training_data}} \left[ \sum_{l=0}^{L-1} \log \hat{\bsf{y}}_l \right]. \\\end{split}\]

Even though we use a fixed value of \(L\) in (15), the length may vary for each example in the training data set. When there are \(V\) classes, then (15) is represented by:

(16)#\[\begin{split} \mathbb{L}_{\text{CE}} & = -E_{\bsf{y}_{0:L} \sim \text{training_data}} \left[ \sum_{l=0}^{L-1} \sum_{v=0}^{V-1} (y_l)_v \log (\hat{\bsf{y}}_l)_v \right]. \\\end{split}\]

In Tensorflow, it is implemented as tfa.seq2seq.sequence_loss method.

Back-Propagation Through Time (BPTT)#

For deep neural network models, we may not directly obtain the gradient. Thus we use the chain rule to obtain the gradient with respect to a certain