When to use Tanh?

Question

When and why would you not use Tanh?

I just replaced ReLU with Tanh and my model trains about 2x faster, reaching 90% acc within 500 steps.

While using ReLU it reached 90% acc in >1000 training steps.

I believe the reason it trained faster was due to a steeper gradient (correct me if I'm wrong).

This led me to wonder, why would we not use Tanh over ReLU for most contexts? Is it the computational complexity? Or something else?

score 10 · Accepted Answer · answered May 13 '24 at 06:55

Using tanh in hidden layers require careful initialization of network weights and works best with the input features normalized within the same range as output (i.e. -1 to 1). It have expensive functions to evaluate and suffers with vanishing gradient. But it converges faster because due to the zero centered output.

But tanh works good with RNNs, as RNNs suffers from vanishing gradients, and they need an activation function whose second derivative can sustain for long range before going to zero.

If you want any of these properties, use tanh.

Without any gist of architecture, it's hard to tell why your model train faster than ReLU. But generally speaking, ReLU trains faster than Tanh due to lack of expensive functions to evaluate (such as exponents)

When to use Tanh?

1 Answers1