2

I am implementing a neural network to train it on handwritten digits.

Here is the cost function that I am implementing.

$$J(\Theta)=-\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K}\left[y_{k}^{(i)} \log \left(\left(h_{\Theta}\left(x^{(i)}\right)\right)_{k}\right)+\left(1-y_{k}^{(i)}\right) \log \left(1-\left(h_{\Theta}\left(x^{(i)}\right)\right)_{k}\right)\right]+ \\\frac{\lambda}{2 m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l}} \sum_{j=1}^{s_{l+1}}\left(\Theta_{j, i}^{(l)}\right)^{2}$$

In $\log(1-(h(x))$, if $h(x)$ is $1$, then it would result in $\log(1-1)= \log(0)$. So, I'm getting a math error.

I'm initializing the weights randomly between 10-60. I'm not sure where I have to change and what I have to change.

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

1

So, firstly, for $h_{\Theta}(x)$ to be $1$, the weighted sum of $x$ (after you dot product it with $\Theta$) would have to be literally infinity, if you're using the sigmoid function. Doesn't happen in practice, even with the rounding computers do, as we don't use big numbers to initialize our $\Theta$ matrices.

Intuitively, that'd mean you're basically more certain than one can possibly be in this universe that the label of this example should be $1$.

So, if $(1 - h_{\Theta}(x)) = 0$, $y$ is certainly $1$, and so $1-y$ will be zero.

Secondly, the convention is to drop the entire right-hand-side term when $y^{(i)}$ is $1$. This will not cause problems when programming, due to the first point I made above.

nbro
  • 42,615
  • 12
  • 119
  • 217
Avik Mohan
  • 706
  • 3
  • 12