3

CONTEXT

I was wondering why there are sigmoid and tanh activation functions in an LSTM cell.

enter image description here

My intuition was based on the flow of tanh(x)*sigmoid(x)

tanh(x)*sigmoid(x)

and the derivative of tanh(x)*sigmoid(x)

(tanh(x)*sigmoid(x)')

It seems to me that authors wanted to choose such a combination of functions, the derivative would make possible big changes around the 0, since we can use normalized data and weights. Another thing is that the output would go to 1 for positive values and go to 0 for negative values which is convenient.

On the other hand, it seems natural that we use sigmoid in forget gate, since we want to have a better focus on the important data. I just don't understand why there cannot only be a sigmoid function in the input gate.

OTHER SOURCES

What I found on the web is this article where the author claims:

To overcome the vanishing gradient problem, we need a method whose second derivative can sustain >for a long range before going to zero. Tanh is a good function that has all the above properties.

However, he doesn't explain why this is the case.

Also, I found the opposite statement here, where the author says that the second derivative of the activation function should go to zero, however, there is no proof for that claim.

QUESTION

Summing up:

  1. Why cannot we put a signal with just a sigmoid on the input gate?
  2. Why there are tanh(x)*sigmoid(x) signals in the input and output gate?
nbro
  • 42,615
  • 12
  • 119
  • 217
MASTER OF CODE
  • 242
  • 2
  • 9

3 Answers3

4

The tanh functions within the cell represent cell output or cell state. These are the values that either get passed on to other layers, or within the layer to the next time step. In theory, other activation functions could be used here according to taste, similar to other feed-forward or RNN networks. However, the -1 to 1 output range of tanh is useful, and I expect tanh has been experimentally validated as a good general case activation function here.

The sigmoid functions are used as soft gates for manipulating the raw RNN values. Importantly for your analysis, there is no sigmoid that takes the same input as any tanh. Each of the green boxes in the cell diagram in your question has a separate learnable set of weights applied to the combined input+hidden_state vector.

That means that your analysis of tanh(x)*sigmoid(x) is moot. The function is effectively tanh(x)*sigmoid(y) because inputs to each activation function can be radically different.

The intuition is that the LSTM can learn relatively "hard" switches to classify when the sigmoid function should be 0 or 1 (depending on the gate function and input data). As the weights are independent on the gates and input value processing components, the gradients to the cell output and state components are not composed in a combined function, but simply multiplied by the current value of the relevant switch. A muliplying hard switch of 1 will allow the gradient to flow back directly from the output loss to the point at which the gate decision was made - depending on which gate was activated, this improved gradient signal will either be to the input processing weights or the hidden state procesing weights.

It is also possible for the input and cell state processing to be mixed in various combinations, and the gradient is not guaranteed strong. However, in situations requiring strong memory-like signals (such as using punctuation characters when processing text), it is possible to observe LSTM learning those signals, effectively classifying inputs with high confidence (close to either 0 or 1), thus creating toggle switches, counters etc, within the cell state vector.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
3

I think a better way to understand LSTMs is by their purpose, instead of gradients and distributions.

If you analyze the interactions of each gate with the cell state, you'll realize that LSTMs essentially implement differentiable memory as a counter.

There are 2 important things to realize:

  • Each sigmoid/tanh is preceded by a linear projection, so the output of the sigmoid of the forget gate is different to that of the input gate.
  • Sigmoid outputs are mostly 0 or 1, while tanh outputs are mostly -1 or 1.

Following the numbering in the diagram:

  1. The forget gate generates a binary mask which is multiplied with the previous cell state. Items with 0 in the mask are 'reset' to 0, while items with 1 in the mask are passed through unchanged.
  2. The input gate generates 2 masks using sigmoid and tanh, whose product is added to the cell state. The tanh mask with, -1 or +1 outputs, determines whether to decrement or increment items in the cell state. The sigmoid mask determines whether an item should be updated at all, similarly to the forget gate.
  3. The output gate determines what to expose to the subsequent layer and state using a similar logic. The tanh here acts as a 'binarizer' to simplify the cell state (which can theoretically be from -infinity to +infinity) to a binary -1 or +1 value, while the sigmoid masks out irrelevant items.

So while you could replace sigmoid and tanh with other activation functions, they make the most sense in this context and have the added benefit of well-defined derivatives. Remember that LSTM's were introduced in 1997 when tanh and sigmoid activation functions were still popular, and ReLU had not taken over yet (~2011).

Cyruno
  • 31
  • 3
1

The purpose of the tanh and sigmoid functions in an LSTM (Long Short-Term Memory) network is to control the flow of information through the cell state, which is the "memory" of the network.

The sigmoid function, also known as the "gate," is used to determine which information is allowed into the cell state. Specifically, it controls the flow of information through the "input gate," which determines which new information is added to the cell state.

The tanh function is used to regulate the values in the cell state. This is important because the cell state can become very large or very small over time, which can make it difficult for the network to learn. The tanh function ensures that the values in the cell state are always within a reasonable range (-1 to 1), which makes it easier for the network to learn.

Together, the sigmoid and tanh functions work together to regulate the flow of information through the LSTM network, allowing it to selectively update its memory while keeping the values in the cell state within a reasonable range. This makes the LSTM network especially effective for tasks that require long-term memory and the ability to selectively forget or remember information over time.

Suman
  • 111
  • 1