2

Say I have some inputs to my model that are continuous variables. What is a good way of representing them in input neurons so that the network can represent and learn various functions effectively?

A standard approach is to simply re-scale the input so that it has mean=0 and stddev=1.

I came up with this alternative scheme.

Use $n$ input neurons, each representing some value $x_1, \ldots, x_n$. When we have a value $x=x_i$ it's represented by setting the value of the corresponding neuron to 1 and others to 0. When $x_i < x < x_{i+1}$, set the two neighboring neurons to a value between 0 and 1 so as to linearly interpolate between the representations for $x_i$ and $x_{i+1}$.

It seems like it would allow the network to more directly treat different ranges of $x$ differently, and thus learn various non-linear functions of $x$ easier. Also it automatically makes the inputs fall in the range $[0, 1]$ without the need for re-scaling.

Does this scheme make sense? Does it have a name?

It seems like keywords like "binning", "triangular kernels" are relevant, but I haven't found a reference that actually describes this scheme.

Tomek Czajka
  • 150
  • 4

1 Answers1

1

Indeed your general idea of discretizing real-valued attributes for some ML models is well known as summarized in a survey paper by Kotsiantis et al (2006) "Discretization Techniques: A recent survey".

A discretization algorithm is needed in order to handle problems with real-valued attributes with Decision Trees (DTs), Bayesian Networks (BNs) and Rule-Learners (RLs)... A good discretization algorithm has to balance the loss of information intrinsic to this kind of process and generating a reasonable number of cut points, that is, a reasonable search space.

Your specific idea is very similar to the little known binning method called soft binning which doesn't have many references. Soft binning is a technique in ML feature engineering where continuous variables are transformed into categorical variables, but instead of assigning each data point to a single bin, it assigns probabilities of belonging to multiple bins via kernel functions such as RBF. This approach preserves more information from the original continuous variable compared to hard binning and subsumes your scheme as a simplified special case since why assigning a real value into just two neighboring neurons?

The works on learnable histograms have tried to approximate this hard-binning process with a soft-binning version. In the soft-binning version, a sample distributes its maximum vote of 1 across several bins, although it is expected to have the maximum vote in the bin it belongs to.

In summary your scheme effectively expands the cursed dimension of input space potentially causing overfit with limited training data and computational complexity, often require careful tuning of hyperparameters of the bins. Instead MLPs with normalized real-valued input are flexible and can approximate a wide range of functions without the need for explicit feature engineering like soft binning or RBF networks which implicitly share similar ideas.

cinch
  • 11,000
  • 3
  • 8
  • 17