Since $g$ is your logistic regression model, over data samples $x$, the output of $g(x)$ is a scalar value between $0$ and $1$ that is usually interpreted as a probability value.
- We have that $g(x)=\sigma(W x + b)$ essentially, where $W,b$ are the parameters that you learn and $\sigma$ is the sigmoid squashing function (the one that keeps everything in $[0, 1]$).
- Furthermore, $g(x)=p(y=1\mid x)$ meaning that $g(x)$ represents the probability that the sample $x$ belongs to the positive or, as the book says in page 49, most probable class - as a side note, this is something that you can change and usually define before learning.
Now, let's assume that the two classes $y=1$ (the positive or most probable) and $y=0$ (the negative or less probable) have real-world probabilities of: $p(y=1\mid x) = 0.7$ and $p(y=0\mid x) = 1 - p(y=1\mid x) = 0.3$. Let's also be realistic, in the sense that the two class distributions overlap: this implies that you have some misclassification errors due to the fact that is not possible to separate the two classes perfectly. Note: there could be many reasons for this, such as mis-labelling, and similarity of data samples between the two classes.
At this point, we have $g(x)=p(y=1\mid x)$ that is the probability of $x$ to belong to the positive class ($y=1$), implying that there is a $70\%$ chance that $x$ has label $1$. In order to perform a binary classification, you need to get rid of the probabilities by thresholding them, such as: $\hat{y}(x) = g(x) > t$.
- If $g(x)>t$ you assign label $1$ (so $\hat{y}(x)=1$), otherwise you assign label $0$.
- In case the model $g$ predicts the real-world class probabilities correctly (i.e., $p(y=1)=0.7$ and $p(y=0)=0.3$) - as assumed by the book - you have that if the threshold is chosen to be $t=0.5$ (or $r=0.5$ as called in the book), happens that $g(x)>0.5$ minimizes the number of misclassifications, equivalently to maximize the accuracy of predictions.
- In other words, for a threshold of $0.5$ you get maximum accuracy. But in real classification problems you care also about other measures, such as the AUC of ROC/PR curves, FPR and TPR, F1-score, etc. Basically, you can target a metric and tune the threshold $t$ to get the rate of error that best suits you application.
If still you have doubts about your question:
if the model provides a correct description of the real-world class probabilities, why do we have misclassifications?
Consider a classifier $g'(x)=0.7$ that always outputs a probability of $0.7$ regardless the data point, $x$. Even if the true class probability is $0.7$, this model is still erroneous $30\%$ of the times when classifying all the $x'$ such that $y(x') = 0$.