8

The idea that weights determine how much influence each input value from the current layer will have when calculating the input to the following layer reminds me of when my professors would say that our final grade for the course was broken down something like this:

Item Percent of Grade
Midterm Exam 1 20%
Midterm Exam 2 20%
Final Exam 50%
Classwork 5%
Homework 5%
Total 100%

My question is then do all the weights need to add up to 1.0 (i.e. 100%)? If not, why?

nbro
  • 42,615
  • 12
  • 119
  • 217
Garrett
  • 191
  • 1
  • 3

4 Answers4

8

No, the weights do not need to add up to one. There isn't really a reason to do that. Weights as "contributions" may not be the best way to think about things here -- you're trying to learn a function that transforms the input values into your desired output values. Say you're learning a linear model and your data is 2D and drawn from a line with slope 2, then you'd want your learned weight to be 2.

Additionally, say you have a feature that is negatively correlated with the output. That is, that a higher value in the input results in a lower outputs value. If all your weights are non-negative, you wouldn't be able to encode this at all!

Finally, as the output would simply be a weighted average of the inputs, the scale of the outputs would be constrained by the scale of the inputs. Consider: $$f(\mathbf{x}) = \Sigma^d_{i=1} w_i x_i$$ with $$\Sigma^d_{i=1} w_i = 1$$

$f(\mathbf{x})$ will always be less than or equal to $\max(\{w_1, ..., w_d\})$. For a simple linear model, you can definitely think of examples where the scale of the output is much larger than that of the input features (for example, number of bathrooms in a house versus its price). Although you can mitigate some of this by rescaling your data (as I will talk about shortly), there is really no reason to artificially constrain your model like this.

A deep neural network can be thought of as stacks of these linear models, with the output of the previous layer being the input of the next layer. Although, these intuitions aren't at all perfectly transferable to deep neural networks, we would still be constraining the expressiveness of the model in a similar way: an increase in an input dimension can't result in a decrease in the output value and the scale of the output is constrained by the scale of the input.

Normalization in Neural Networks

You do, however, often need to perform normalization, but this normalization is done on the data and the outputs of hidden layers instead of the weights. If you add a weight penalty to your linear model for example, you want to make sure that each dimension is in around the same range, to not overly penalize features that tend to have larger weights. In neural networks, you also often normalize weight values for hidden layers, the most common variant is Batch Normalization.

However, the normalization often used in these cases is a bit different from what you mentioned. Also known as standardization, this kind of normalization subtracts each sample by the mean (over a set of samples) and divides by the standard deviation (over that same set of samples).

The exact reason why this is effective in neural networks is an active area of research. A common explanation is batch normalization reduces the effects of internal covariant shift, where changes in weights during training results in changes in the output distribution of a layer, which can harm performance as the next layer "expects" the original distribution. Centering and scaling the input to have zero mean and unit standard deviation reduces these changes. This, however, is disputed by this paper, which argues that batch normalization helps by making the loss landscape smoother.

Alexander Wan
  • 1,409
  • 1
  • 12
6

Nothing to do with it! Let me illustrate this concept using one of the simplest models in machine learning: Linear Regression. In linear regression, the goal is to find the coefficients of a hyperplane that best fits the given data, represented by the equation $Ax=b$. Here, we have free variables that represent the slopes along each axis and an independent term for the height (that in neural nets is called the Bias). Now, to address your question: Do the coefficients need to add up to 1? The answer is no.

You might be tempted to interpret the concept of weights with the concept of probabilities, but they are not the same! They are merely coefficients in a high-dimensional, nested function.

enter image description here

Here's a 3D example of linear regression. The blue points represent the data, and the red plane is the best-fitting hyperplane according to the linear regression model.

The equation of the fitted plane is approximately:

$y=2.79⋅x_1+2.01⋅x_2+4.11$

As you can see, the coefficients are $2.79$ for $x_1$ $2.01$ for $x_2$ , and the intercept is $4.11$. If we sum these together, we get $2.79 + 2.01 + 4.11 = 8.91$, which is clearly not equal to $1$.

Cesar Ruiz
  • 680
  • 2
  • 8
3

do all the weights need to add up to 1.0 (i.e. 100%)?

No.

If not, why?

Why should they?

In the case of school grades, the justification might be that each component (midterm exam, homework etc.) is graded separately on a same scale and the final grade, which must lie on the same scale, is computed as the weighted average of these components.

There is no such requirement regarding weights in neural networks.

What you might want to enforce is that the outputs in the last layer sum to one, to obtain probabilities of class memberships (this, obviously, applies only to classification problems and not to regression). Softmax is typically used for that purpose.

Igor F.
  • 201
  • 1
  • 6
3

No, weights in that case do not constitute a formal probability distribution and do not need to be add up to 1. In fact, if they would — it could cause problems with underflow. Imagine a (nonsensical) neural network — n inputs, followed by 10 layers, each containing 10k parameters. This MLP would possess 100,000 parameters, excluding biases and the output layer. Given the novel architectures, it is not even a large number of parameters. Now, we normalize all the weights (layer or network-wise) so they form a valid probability distribution. It is easy to see, that the forward and backward passes could be problematic as the value of each parameter would be close to 0. Also, each gradient could be negligible small - a phenomenon that is observable now in NN due to various reasons Vanishing Gradient Problem.

How do we interpret the activity of the Neural Networks? It is a reasonable question — there is a field of research that is called an Explainable Artificial Intelligence and deals with the interpretation of many deep learning architectures. Given that deep learning basically provides for creating a sequence of nested functions that can approximate arbitrary difficult non-linear functions it is a non-trivial task. If the weights would sum up to 1 — it could greatly help the progress. Unfortunately, this is not the case ;)

Scolpe
  • 31
  • 3