6

In deep learning, is it possible to use discontinuous activation functions (e.g. one with jump discontinuity)?

(My guess: for example, ReLU is non-differentiable at a single point, but it still has a well-defined derivative. If an activation function has a jump discontinuity, then its derivative is supposed to have a delta function at that point. However, the backpropagation process is incapable of considering that delta function, so the convex optimization process will have some problem?)

nbro
  • 42,615
  • 12
  • 119
  • 217
Gyeonghoon Ko
  • 61
  • 1
  • 2

2 Answers2

5

Even the first artificial neural network - Rosenblatt's perceptron [1] had a discontinuous activation function. That network is in introductory chapters of many textbooks about AI. For example, Michael Negnevitsky. Artificial intelligence: a guide to intelligent systems. Second Edition shows how to train such networks on pages 170-174.

Error backpropagation algorithm can be modified to accommodate discontinuous activation functions. The details are in paper [2]. That paper points out a possible application: training a neural network on micro-controllers. As the multiplication of the output of the previous layer $x_j$ by the weigth $w_{ij}$ is exspensive, the author suggested to approximate it with a left shift by $n$ bits (multiplication by $2^n$) for the corresponding $n$ in which case the activation function is discontinuous (a staircase).

An example of a neural network with discontinuous activation functions applied to a restoration of degraded images is in Ref. [3]. Applications of recurrent neural networks with discontinuous activation functions to convex optimization problems are in Ref. [4]. Probably more examples can be found in the literature.

References

  1. Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 1958 Nov; 65(6):386-408. PMID: 13602029 DOI: 10.1037/h0042519
  2. Findlay, D.A. Training networks with discontinuous activation functions. 1989 First IEE International Conference on Artificial Neural Networks, (Conf. Publ. No. 313), London, UK, 1989, pp. 361-363.
  3. Ferreira, L. V.; Kaszkurewicz, E.; Bhaya, A. Image restoration using L1-norm regularization and a gradient-based neural network with discontinuous activation functions. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 2512-2519. DOI: 10.1109/IJCNN.2008.4634149
  4. Liu, Q.; Wang, J. Recurrent Neural Networks with Discontinuous Activation Functions for Convex Optimization. Integration of Swarm Intelligence and Artificial Neural Network, pp. 95-119 (2011) DOI: 10.1142/9789814280150_0004
1

I would say that it is possible, but probably not a very good idea.?Like you say, the hard requirement is that the network (and thus its components, including the activation functions) must be differentiable. ReLU isn't, but you can cheat by defining f'(0) to be 0 (or 1).

A continuous function means that gradient descent leads to some local minimum¹, for piecewise continuous functions, it may not converge (i.e. the breakpoints themselves may not be part of the segment you descend, so you will never get get to an actual minimum). This is not likely to be a problem in practice, though.

¹ At least for functions that are bounded from below, like cost functions are.

Ketil Malde
  • 234
  • 1
  • 4