Why use a recurrent neural network over a feedforward neural network for sequence prediction?

Question

If recurrent neural networks (RNNs) are used to capture prior information, couldn't the same thing be achieved by a feedforward neural network (FFNN) or multi-layer perceptron (MLP) where the inputs are ordered sequentially?

Here's an example I saw where the top line of each section represents letters typed and the next row represents the predicted next character (red letters in the next row means a confident prediction).

Wouldn't it be simpler to just pass the $X$ number of letters leading up to the last letter into an FFNN?

For example, if $X$ equaled 4, the following might be the input to the FFNN

S, T, A, C => Prediction: K

Clement · Answer 1 · 2019-11-03T07:00:36.540

An RNN or LSTM have the advantage of "remembering" the past inputs, to improve performance over prediction of a time-series data. If you use a neural network over like the past 500 characters, this may work but the network just treat the data as a bunch of data without any specific indication of time. The network can learn the time representation only through gradient descent. RNN or LSTM however have "time" as a mechanism built into the model. The model loops through the model sequentially and have a real "sense of time" even before the model is trained. The model also have "memory" of previous data points to help the prediction. The architecture is based on the progress of time and the gradient are propagated through time as well. This is a much more intuitive way to process time-series data.

A 1D CNN also will work for the task. An example of CNN in time series data is wavenet, which uses CNN for generating incredibly life like speech using dilated convolution neural network. For whether LSTM or CNN works better, it depends on the data. You should try experimenting with both networks to see which works best.

Suppose you need to classify a video's genre. It is much simpler to watch it in sequence then seeing frames of it playing randomly in front of your eyes. This is why an RNN or an LSTM works better in time series data.

score 3 · Accepted Answer · edited Nov 03 '19 at 21:36

Assumptions

Different model structures encode different assumptions - while we often make simplifying assumptions that aren't strictly correct, some assumptions are more wrong than others.

For example, your proposed structure of "just pass the $X$ number of letters leading up to the last letter into an FFNN" makes an assumption that all the information relevant for the decision is fully obtainable from the $X$ previous letters, and $(X+1)$st and earlier input letters are not relevant - in some sense, an extension of the Markov property. Obviously, that's not true in many cases, there are all kinds of structures where long term relationships matter, and assuming that they don't lead to a model that intentionally doesn't take such relationships into account. Furthermore, it would make an independence assumption that the effect of $X$th, $(X-1)$st and $(X-2)$nd elements on the current output is entirely distinct and separate, you don't make an assumption that those features are related, while in most real problems they are.

The classic RNN structures also make some implicit assumptions, namely, that only the preceding elements are relevant for the decision (which is wrong for some problems, where information from the following items is also required), and that the transformative relationship between the input, output and the passed-on state is the same for all elements in the chain, and that it doesn't change over time; That's also not certainly true in all cases, this is quite a strong restriction, but that's generally less wrong than the assumption that the last $X$ elements are sufficient, and powerful true (or mostly true) restrictions are useful (e.g. the No Free Lunch Theorem applies) for models that generalize better; just like e.g. enforcing translational invariance for image analysis models, etc.

Why use a recurrent neural network over a feedforward neural network for sequence prediction?

2 Answers2

Assumptions