4

When watching the machine learning course on Coursera by Andrew Ng, in the logistic regression week, the cost function was a bit more complex than the one for linear regression, but definitely not that hard.

But it got me thinking, why not use the same cost function for logistic regression?

So, the cost function would be $\frac{1}{2m} \sum_{i}^m|h(x_i) - y_i|^2$, where $h(x_i)$ is our hypothesis $\text{function}(\text{sigmoid}(X * \theta))$, $m$ is the number of training examples and $x_i$ and $y_i$ are our $ith$ training example?

nbro
  • 42,615
  • 12
  • 119
  • 217

4 Answers4

4

The mean squared error (MSE), $J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x_i)-y_i)^2$, is not as appropriate as a cost function for classification, given that the MSE makes assumptions about the data that are not appropriate for classification. Though, as an optimization objective, it is still possible to attempt to minimize MSE even in a classification problem, and thus still learn parameters $\theta$.

The new cost function has better convergence characteristics as it is more inline with the objective.

See link for the precise mathematical formulation that explains these loss functions from a probabilistic perspective.

Note that the absolute value is redundant because $\forall x:x^2\geq0$.

I hope this clarifies the matter.

nbro
  • 42,615
  • 12
  • 119
  • 217
respectful
  • 1,126
  • 10
  • 26
1

I mean you technically could (it's not going to break or something) however, cross entropy is much better suited for classification as it penalizes for misclassification errors: have a look at the function: when you are wrong the loss goes to infinity: enter image description here

you are either from one class or another. MSE is designed for regression where you have nuance: you get close to target is sometimes good enough. You should try both and you will see the performance will be much better for the cross entropy.

RonsenbergVI
  • 156
  • 1
1

Minimising MSE in a classification setting is perfectly reasonable as it is also known as the Brier Score and is a proper scoring rule which means that it is minimised if the network outputs the conditional probability of class membership. This is not unduly surprising as minimising MSE leads to a model that outputs an estimate of the conditional mean of the target distribution, which for a 1-of-c coding is the conditional probability of class membership. You can even use MSE for training networks with logistic or softmax activation functions in the output layer so that the obey the usual constraints of being in the interval $[0,1]$ and summing to one.

However, the MSE penalises very confident misclassifications much less harshly than the cross-entropy metric does. Whether this is a good or bad thing depends on the needs of the application. If you are mostly interested in the p=0.5 decision boundary, then you probably don't want model resources spent dealing with highly confident misclassifications, which are a long way from the decision boundary, and have little effect on it. This is a large part of the justification for purely discriminative methods like the SVM.

0

This is also asked here

My response there is specific to MSE with sigmoid evaluation of the input: (y - sigmoid(X))^2 simply to keep the range of input evaluation within 0 and 1.