Deep learning Guide 7: Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU)

type

status

date

slug

password

summary

门控循环单元（GRU）

重置门和更新门

Mathematically, for a given time step , suppose that the input is a minibatch (number of examples ; number of inputs ) and the hidden state of the previous time step is (number of hidden units ). Then the reset gate and update gate are computed as follows:

where and are weight parameters and are bias parameters.

候选隐状态

Next, we integrate the reset gate with the regular updating mechanism, leading to the following candidate hidden state at time step :

where and are weight parameters, is the bias, and the symbol is the Hadamard (elementwise) product operator.

隐状态

Finally, we need to incorporate the effect of the update gate . This determines the extent to which the new hidden state matches the old state compared with how much it resembles the new candidate state . The update gate can be used for this purpose, simply by taking elementwise convex combinations of and . This leads to the final update equation for the GRU:

Implementation GRU

Long Short-Term Memory (LSTM)

Mathematically, suppose that there are hidden units, the batch size is , and the number of inputs is $d$. Thus, the input is and the hidden state of the previous time step is . Correspondingly, the gates at time step $t$ are defined as follows: the input gate is , the forget gate is , and the output gate is . They are calculated as follows:

where and are weight parameters and are bias parameters.

We use sigmoid functions to map the input values to the interval .

Input Node

Next we design the memory cell. Since we have not specified the action of the various gates yet, we first introduce the input node . Its computation is similar to that of the three gates described above, but uses a function with a value range for as the activation function. This leads to the following equation at time step :

Memory Cell Internal State

In LSTMs, the input gate governs how much we take new data into account via and the forget gate addresses how much of the old cell internal state we retain. Using the Hadamard (elementwise) product operator we arrive at the following update equation:

Implementation LSTM

Concise Implementation of GRU

Concise Implementation of LSTM

LSTMs are the prototypical latent variable autoregressive model with nontrivial state control. Many variants thereof have been proposed over the years, e.g., multiple layers, residual connections, different types of regularization. However, training LSTMs and other sequence models (such as GRUs) is quite costly because of the long range dependency of the sequence. Later we will encounter alternative models such as Transformers that can be used in some cases.