3

I am reading the book: MACHINE LEARNING- A First Course for Engineers and Scientists, by Lindholm et.al.

Chapter 3, page 50.

Link: http://smlbook.org/book/sml-book-draft-latest.pdf

Consider the logistic regression for classification.

$ \hat{y}(\mathbf{x})= \begin{cases} &1 & g(\mathbf{x})>r \\\\ &-1 & g(\mathbf{x})\le r & \end{cases} $

The book says "It can be shown that if $g(\mathbf{x})=p(y=1|\mathbf{x})$, that is, the model provides a correct description of the real-world class probabilities, then the choice $ = 0.5$ will give the smallest possible number of misclassifications on average."

I am confused here because if the model provides a correct description of the real-world class probabilities, why do we have misclassifications?

DSPinfinity
  • 1,223
  • 4
  • 10

2 Answers2

4

Since $g$ is your logistic regression model, over data samples $x$, the output of $g(x)$ is a scalar value between $0$ and $1$ that is usually interpreted as a probability value.

  • We have that $g(x)=\sigma(W x + b)$ essentially, where $W,b$ are the parameters that you learn and $\sigma$ is the sigmoid squashing function (the one that keeps everything in $[0, 1]$).
  • Furthermore, $g(x)=p(y=1\mid x)$ meaning that $g(x)$ represents the probability that the sample $x$ belongs to the positive or, as the book says in page 49, most probable class - as a side note, this is something that you can change and usually define before learning.

Now, let's assume that the two classes $y=1$ (the positive or most probable) and $y=0$ (the negative or less probable) have real-world probabilities of: $p(y=1\mid x) = 0.7$ and $p(y=0\mid x) = 1 - p(y=1\mid x) = 0.3$. Let's also be realistic, in the sense that the two class distributions overlap: this implies that you have some misclassification errors due to the fact that is not possible to separate the two classes perfectly. Note: there could be many reasons for this, such as mis-labelling, and similarity of data samples between the two classes.

At this point, we have $g(x)=p(y=1\mid x)$ that is the probability of $x$ to belong to the positive class ($y=1$), implying that there is a $70\%$ chance that $x$ has label $1$. In order to perform a binary classification, you need to get rid of the probabilities by thresholding them, such as: $\hat{y}(x) = g(x) > t$.

  • If $g(x)>t$ you assign label $1$ (so $\hat{y}(x)=1$), otherwise you assign label $0$.
  • In case the model $g$ predicts the real-world class probabilities correctly (i.e., $p(y=1)=0.7$ and $p(y=0)=0.3$) - as assumed by the book - you have that if the threshold is chosen to be $t=0.5$ (or $r=0.5$ as called in the book), happens that $g(x)>0.5$ minimizes the number of misclassifications, equivalently to maximize the accuracy of predictions.
  • In other words, for a threshold of $0.5$ you get maximum accuracy. But in real classification problems you care also about other measures, such as the AUC of ROC/PR curves, FPR and TPR, F1-score, etc. Basically, you can target a metric and tune the threshold $t$ to get the rate of error that best suits you application.

If still you have doubts about your question:

if the model provides a correct description of the real-world class probabilities, why do we have misclassifications?

Consider a classifier $g'(x)=0.7$ that always outputs a probability of $0.7$ regardless the data point, $x$. Even if the true class probability is $0.7$, this model is still erroneous $30\%$ of the times when classifying all the $x'$ such that $y(x') = 0$.

Luca Anzalone
  • 3,216
  • 4
  • 15
-1

Note your book specifically claims 'the model provides a correct description of the real-world class probabilities', which exactly means the multi-explanatory variables logistic function $g(\mathbf{x})$ trained in your logistic binary classifier $\hat{y}(\mathbf{x})$, as a special type of S-shaped sigmoid function transforming the linear combination of input features into a probability between 0 and 1 could be possibly correct in the real world. And if this is true, then intuitively $r=0.5$ gives the best overall accuracy and the smallest possible number of misclassifications on average.

In summary it's the logistic model $g(\mathbf{x})$ could possibly provide a correct description of the real-world class probabilities, but even this is the case, the resulting binary classifier $\hat{y}(\mathbf{x})$ would still possibly have misclassification since its accuracy may not be 100% either on your training samples or unseen test samples.

cinch
  • 11,000
  • 3
  • 8
  • 17