8

My weights go from being between 0 and 1 at initialization to exploding into the tens of thousands in the next iteration. In the 3rd iteration, they become so large that only arrays of nan values are displayed.

How can I go about fixing this?

Is it to do with the unstable nature of the sigmoid function, or is one of my equations incorrect during backpropagation which makes my gradients explode?

import numpy as np
from numpy import exp
import matplotlib.pyplot as plt
import h5py

LOAD DATASET

MNIST_data = h5py.File('data/MNISTdata.hdf5', 'r') x_train = np.float32(MNIST_data['x_train'][:]) y_train = np.int32(np.array(MNIST_data['y_train'][:,0])) x_test = np.float32(MNIST_data['x_test'][:]) y_test = np.int32(np.array(MNIST_data['y_test'][:,0])) MNIST_data.close()

##############################################################################

PARAMETERS

number_of_digits = 10 # number of outputs nx = x_test.shape[1] # number of inputs ... 784 --> 28*28 ny = number_of_digits m_train = x_train.shape[0] m_test = x_test.shape[0] Nh = 30 # number of hidden layer nodes alpha = 0.001 iterations = 3 ##############################################################################

ONE HOT ENCODER - encoding y data into 'one hot encoded'

lr = np.arange(number_of_digits) y_train_one_hot = np.zeros((m_train, number_of_digits)) y_test_one_hot = np.zeros((m_test, number_of_digits)) for i in range(len(y_train_one_hot)): y_train_one_hot[i,:] = (lr==y_train[i].astype(np.int)) for i in range(len(y_test_one_hot)): y_test_one_hot[i,:] = (lr==y_test[i].astype(np.int))

VISUALISE SOME DATA

for i in range(5): img = x_train[i].reshape((28,28)) plt.imshow(img, cmap='Greys') plt.show()

y_train = np.array([y_train]).T y_test = np.array([y_test]).T ##############################################################################

INITIALISE WEIGHTS & BIASES

params = { "W1": np.random.rand(nx, Nh), "b1": np.zeros((1, Nh)), "W2": np.random.rand(Nh, ny), "b2": np.zeros((1, ny)) }

TRAINING

activation function

def sigmoid(z): return 1/(1+exp(-z))

derivative of activation function

def sigmoid_der(z): return z*(1-z)

softamx function

def softmax(z): return 1/sum(exp(z)) * exp(z)

softmax derivative is alike to sigmoid

def softmax_der(z): return sigmoid_der(z)

def cross_entropy_error(v,y): return -np.log(v[y])

forward propagation

def forward_prop(X, y, params): outs = {} outs['A0'] = X outs['Z1'] = np.matmul(outs['A0'], params['W1']) + params['b1'] outs['A1'] = sigmoid(outs['Z1']) outs['Z2'] = np.matmul(outs['A1'], params['W2']) + params['b2'] outs['A2'] = softmax(outs['Z2'])

outs['error'] = cross_entropy_error(outs['A2'], y) return outs

back propagation

def back_prop(X, y, params, outs): grads = {} Eo = (y - outs['A2']) * softmax_der(outs['Z2']) Eh = np.matmul(Eo, params['W2'].T) * sigmoid_der(outs['Z1']) dW2 = np.matmul(Eo.T, outs['A1']).T dW1 = np.matmul(Eh.T, X).T db2 = np.sum(Eo,0) db1 = np.sum(Eh,0)

grads['dW2'] = dW2 grads['dW1'] = dW1 grads['db2'] = db2 grads['db1'] = db1

print('dW2:',grads['dW2'])

return grads

optimise weights and biases

def optimise(X,y,params,grads): params['W2'] -= alpha * grads['dW2'] params['W1'] -= alpha * grads['dW1'] params['b2'] -= alpha * grads['db2'] params['b1'] -= alpha * grads['db1'] return

main

for epoch in range(iterations): print(epoch) outs = forward_prop(x_train, y_train, params) grads = back_prop(x_train, y_train, params, outs) optimise(x_train,y_train,params,grads) loss = 1/ny * np.sum(outs['error']) print(loss)

```

hanugm
  • 4,102
  • 3
  • 29
  • 63

2 Answers2

6

This problem is called exploding gradients, resulting in an unstable network that at best cannot learn from the training data and at worst results in NaN weight values that can no longer be updated.

One way to assure it is exploding gradients, is if loss is unstable and not improving, or if loss shows NaN value during training.

Apart from the usual gradient clipping and weights regularization that are recommended, I think the problem with your network is the architecture.

30 is an abnormally high number of nodes for 2 layer perceptron model. Try increasing number of layers and reducing nodes per layer. - This is under the assumption that you're experimenting with MLP's, because for the problem above, convolutional neural networks seem like an obvious way to go. If unexplored - definitely check out CNN's for digit recognition, two layer models will surely work better there.

Hope this helped!

ashenoy
  • 1,419
  • 6
  • 19
1

Try using float64 instead of float32; int64 instead of int32; increasing the bits of memory gradually increases the weights that can be stored

Eesa
  • 11
  • 1