The attention module can be seen in many way. Indeed, you can just see as a way to construct a set of weights on the fly (in poor words this is what it does):
$$
A = softmax(QK/\sqrt{d})\\
output = A\cdot V
$$
Indeed, you can see it as a way to aggregate length-independent sequence (a dense layer does not have this property as it requires a fixed size input).
Such mechanisms can be found in literature as "hyper networks" (you use a network to estimate the weights of another network)
Nothing stops you from using a different activation function... whoever you will loose the "interpretation" of that matrix being an attention/distribution
What you gain instead is scalability. Indeed, the problem of the $O(n^2)$ complexity comes from the fact that you need the normalization of the softmax. However if you use another activation, as ReLU, or even linear, you can go back to the $O(n)$ beloved complexity