Training LSTM is not a easy thing for beginner in this field. There are a lot of tricks in choosing the most appropriate hyperparameters and structures, which has to be learned from a lot of experience. In this article, I’d love to share some tricks that I summarized from my experience and docs on the Internet.
Dealing with variable-length sequence
Variable-length sequence can sometimes be very annoying, especially when we want to apply minibatch to accelerate the training. Fortunately, PyTorch has prepared a bunch of tools to facilitate it. You can find two useful functions in
pad_packed_sequence(). Here are the prototypes of these two functions:
pack_sequence() takes a list as the input, which contains the sequence you want to feed the LSTM, and it returns a PackedSequence (Refer to the docs of
torch.nn.utils.rnn for details).
pad_packed_sequence(sequence, batch_first=False, padding_value=0.0, total_length=None)
pad_packed_sequence() takes a PackedSequence as the input, and returns a padded tensor.
Here is an example.
>>> import torch >>> from torch.nn.utils.rnn import pack_sequence >>> from torch.nn.utils.rnn import pad_packed_sequence >>> a = torch.tensor([1, 2, 3]) >>> b = torch.tensor([1, 2, 3, 4]) >>> c = torch.tensor([1, 2, 3, 4, 5]) >>> packed_sequence = pack_sequence([a, b, c], enforce_sorted=False) >>> print(packed_sequence) PackedSequence(data=tensor([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5]), batch_sizes=tensor([3, 3, 3, 2, 1]), sorted_indices=tensor([2, 1, 0]), unsorted_indices=tensor([2, 1, 0])) >>> sequence = pad_packed_sequence(packed_sequence) >>> print(sequence) (tensor([[1, 1, 1], [2, 2, 2], [3, 3, 3], [0, 4, 4], [0, 0, 5]]), tensor([3, 4, 5]))
They also work well for high-dimension inputs, such as the sentence represention after embedding.
These functions are really conveinent. You just need to place
pack_sequence() before your LSTM and
pad_packed_sequence after your LSTM. Here is part of implementation of my BiLSTM:
embed = [self.embed(i) for i in x] packed_embed = pack_sequence(embed, enforce_sorted=False) bilstm_out, _ = self.bilstm(packed_embed) bilstm_out, _ = pad_packed_sequence(bilstm_out)
Special Notes for Dropout
Dropout cannot take a
PackedSequence as the inputs, so you are supposed to dropout your inputs before they are packed to be
PackedSequence. I achieve this goal by this:
# In __init__(self) of your self-defined LSTM class self.dropout = nn.Dropout(p=0.5) # In forward(self, x) of your self-defined LSTM class embed = [self.dropout(self.embed(i)) for i in x] packed_embed = pack_sequence(embed, enforce_sorted=False) bilstm_out, _ = self.bilstm(packed_embed) bilstm_out, _ = pad_packed_sequence(bilstm_out)
However, I think this operation will slow down the process, since the operation of
list is in CPU.
Get the Output of last nodes
After observing the output in the example of Minibatch, we can find that it’s not easy to get the final output of LSTM. If you just index by -1, you will get a lot of ZEROs, which are added by the padding. Here is a resolution I find on StackOverflow:
indexes = [len(i) - 1 for i in x] # x is the input, which is a list contains sentence embeddings in a batch bilstm_out = bilstm_out[indexes, range(bilstm_out.shape), :]
And there is another way to achieve so:
indexes = torch.tensor([len(i) - 1 for i in x]).to(torch.long) indexes = indexes.unsqueeze(0) bilstm_out = bilstm_out.gather(0, indexes).squeeze(0)
A lot of literature point out that the initialization has huge impact on the performance of LSTM. For the implementation in Pytorch, there are three set of parameters for 1-layer LSTM, which are
bias_hh_l0. Pytorch initializes them with a Gaussian distribution, but that’s usually not the best initialization. Pytorch has implemented a set of initialization methods. They could be found here. For LSTM, it is recommended to use
nn.init.orthogonal_() to initialize weights, to use
nn.init.zeros_() to initialize all the biases except that of the forget gates, and to use
nn.init.zeros_() to initialize the bias of forget gates. To initialize the bias of forget gates will help LSTM better learn long-term dependency.
According to an issue on Github,
hh are identical, hence they should be operated in the same way.
The complete codes are as follows:
lstm = nn.LSTM(input_size, hidden_size) nn.init.orthogonal_(lstm.weight_ih_l0) nn.init.orthogonal_(lstm.weight_hh_l0) nn.init.orthogonal_(lstm.weight_ih_l0) nn.init.orthogonal_(lstm.weight_hh_l0) nn.init.zeros_(lstm.bias_ih_l0) nn.init.ones_(lstm.bias_ih_l0[hidden_size:hidden_size*2]) nn.init.zeros_(lstm.bias_hh_l0) nn.init.ones_(lstm.bias_hh_l0[hidden_size:hidden_size*2])