I choose the activation function for the output layer depending on the output that I need and the properties of the activation function that I know. For example, I choose the sigmoid function when I'm dealing with probabilities, a ReLU when I'm dealing with positive values, and a linear function when I'm dealing with general values.
In hidden layers, I use a leaky ReLU to avoid dead neurons instead of the ReLU, and the tanh instead of the sigmoid. Of course, I don't use a linear function in hidden units.
However, the choice for them in the hidden layer is mostly due to trial and error.
Is there any rule of thumb of which activation function is likely to work well in some situations?
Take the term situations as general as possible: it could be referring to the depth of the layer, to the depth of the NN, to the number of neurons for that layer, to the optimizer that we chose, to the number of input features of that layer, to the application of this NN, etc.
The more activation functions I discover the more I'm confused in the choice of the function to use in hidden layers. I don't think that flipping a coin is a good way of choosing an activation function.
 
     
     
     
     
    