102

As far as I can tell, neural networks have a fixed number of neurons in the input layer.

If neural networks are used in a context like NLP, sentences or blocks of text of varying sizes are fed to a network. How is the varying input size reconciled with the fixed size of the input layer of the network? In other words, how is such a network made flexible enough to deal with an input that might be anywhere from one word to multiple pages of text?

If my assumption of a fixed number of input neurons is wrong and new input neurons are added to/removed from the network to match the input size I don't see how these can ever be trained.

I give the example of NLP, but lots of problems have an inherently unpredictable input size. I'm interested in the general approach for dealing with this.

For images, it's clear you can up/downsample to a fixed size, but, for text, this seems to be an impossible approach since adding/removing text changes the meaning of the original input.

nbro
  • 42,615
  • 12
  • 119
  • 217
Asciiom
  • 1,171
  • 2
  • 9
  • 10

4 Answers4

73

Three possibilities come to mind.

The easiest is the zero-padding. Basically, you take a rather big input size and just add zeroes if your concrete input is too small. Of course, this is pretty limited and certainly not useful if your input ranges from a few words to full texts.

Recurrent NNs (RNN) are a very natural NN to choose if you have texts of varying size as input. You input words as word vectors (or embeddings) just one after another and the internal state of the RNN is supposed to encode the meaning of the full string of words. This is one of the earlier papers.

Another possibility is using recursive NNs. This is basically a form of preprocessing in which a text is recursively reduced to a smaller number of word vectors until only one is left - your input, which is supposed to encode the whole text. This makes a lot of sense from a linguistic point of view if your input consists of sentences (which can vary a lot in size), because sentences are structured recursively. For example, the word vector for "the man", should be similar to the word vector for "the man who mistook his wife for a hat", because noun phrases act like nouns, etc. Often, you can use linguistic information to guide your recursion on the sentence. If you want to go way beyond the Wikipedia article, this is probably a good start.

nbro
  • 42,615
  • 12
  • 119
  • 217
BlindKungFuMaster
  • 4,265
  • 13
  • 23
22

Others already mentioned:

  • zero padding
  • RNN
  • recursive NN

so I will add another possibility: using convolutions different number of times depending on the size of input. Here is an excellent book which backs up this approach:

Consider a collection of images, where each image has a different width and height. It is unclear how to model such inputs with a weight matrix of fixed size. Convolution is straightforward to apply; the kernel is simply applied a different number of times depending on the size of the input, and the output of the convolution operation scales accordingly.

Taken from page 354, 9.7 Data Types, 3rd paragraph. You can read it further to see some other approaches.

bit_scientist
  • 241
  • 2
  • 5
  • 16
Salvador Dali
  • 516
  • 3
  • 8
13

In NLP you have an inherent ordering of the inputs so RNNs are a natural choice.

For variable sized inputs where there is no particular ordering among the inputs, one can design networks which:

  1. use a repetition of the same subnetwork for each of the groups of inputs (i.e. with shared weights). This repeated subnetwork learns a representation of the (groups of) inputs.
  2. use an operation on the representation of the inputs which has the same symmetry as the inputs. For order invariant data, averaging the representations from the input networks is a possible choice.
  3. use an output network to minimize the loss function at the output based on the combination of the representations of the input.

The structure looks as follows:

network structure

Similar networks have been used to learn the relations between objects (arxiv:1702.05068).

A simple example of how to learning the sample variance of a variable sized set of values is given here (disclaimer: I'm the author of the linked article).

Andre Holzner
  • 279
  • 2
  • 4
0

one other big thing which gives this ability of dealing varying input sizes, which is the most important reason that they do such, is their independence of model of sequence input and out sizes. so there should be no layer which one of its input or out sizes is input or out sizes of your sequence, defined. like in rnn family of neural networks, which are used for varying size sequences, as they rely on dimension of each sequence item. also we can see it in transformers which are also used for varying size sequences we can this independence, as its been demonstrated in this repo of multivariate transformers for forecasting(note in this models usually its only dependent on model max input size which is needed in some embeddings, but note the network itself is independent of sequence input and output sizes).