Perhaps too general a question, but can anyone explain what would cause a Convolutional Neural Network to diverge?
Specifics:
I am using Tensorflow's iris_training model with some of my own data and keep getting
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback...
tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError: NaN loss during training.
Traceback originated with line:
tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[300, 300, 300],
#optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.001, l1_regularization_strength=0.00001),
n_classes=11,
model_dir="/tmp/iris_model")
I've tried adjusting the optimizer, using a zero for learning rate, and using no optimizer. Any insights into network layers, data size, etc is appreciated.
This question is related to
python
tensorflow
machine-learning
keras
theano
I'd like to plug in some (shallow) reasons I have experienced as follows:
Hope that helps.
If you're training for cross entropy, you want to add a small number like 1e-8 to your output probability.
Because log(0) is negative infinity, when your model trained enough the output distribution will be very skewed, for instance say I'm doing a 4 class output, in the beginning my probability looks like
0.25 0.25 0.25 0.25
but toward the end the probability will probably look like
1.0 0 0 0
And you take a cross entropy of this distribution everything will explode. The fix is to artifitially add a small number to all the terms to prevent this.
If using integers as targets, makes sure they aren't symmetrical at 0.
I.e., don't use classes -1, 0, 1. Use instead 0, 1, 2.
In my case I got NAN when setting distant integer LABELs. ie:
So, not use a very distant Label.
EDIT You can see the effect in the following simple code:
from keras.models import Sequential
from keras.layers import Dense, Activation
import numpy as np
X=np.random.random(size=(20,5))
y=np.random.randint(0,high=5, size=(20,1))
model = Sequential([
Dense(10, input_dim=X.shape[1]),
Activation('relu'),
Dense(5),
Activation('softmax')
])
model.compile(optimizer = "Adam", loss = "sparse_categorical_crossentropy", metrics = ["accuracy"] )
print('fit model with labels in range 0..5')
history = model.fit(X, y, epochs= 5 )
X = np.vstack( (X, np.random.random(size=(1,5))))
y = np.vstack( ( y, [[8000]]))
print('fit model with labels in range 0..5 plus 8000')
history = model.fit(X, y, epochs= 5 )
The result shows the NANs after adding the label 8000:
fit model with labels in range 0..5
Epoch 1/5
20/20 [==============================] - 0s 25ms/step - loss: 1.8345 - acc: 0.1500
Epoch 2/5
20/20 [==============================] - 0s 150us/step - loss: 1.8312 - acc: 0.1500
Epoch 3/5
20/20 [==============================] - 0s 151us/step - loss: 1.8273 - acc: 0.1500
Epoch 4/5
20/20 [==============================] - 0s 198us/step - loss: 1.8233 - acc: 0.1500
Epoch 5/5
20/20 [==============================] - 0s 151us/step - loss: 1.8192 - acc: 0.1500
fit model with labels in range 0..5 plus 8000
Epoch 1/5
21/21 [==============================] - 0s 142us/step - loss: nan - acc: 0.1429
Epoch 2/5
21/21 [==============================] - 0s 238us/step - loss: nan - acc: 0.2381
Epoch 3/5
21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
Epoch 4/5
21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
Epoch 5/5
21/21 [==============================] - 0s 188us/step - loss: nan - acc: 0.2381
If you'd like to gather more information on the error and if the error occurs in the first few iterations, I suggest you run the experiment in CPU-only mode (no GPUs). The error message will be much more specific.
Source: https://github.com/tensorflow/tensor2tensor/issues/574
Although most of the points are already discussed. But I would like to highlight again one more reason for NaN which is missing.
tf.estimator.DNNClassifier(
hidden_units, feature_columns, model_dir=None, n_classes=2, weight_column=None,
label_vocabulary=None, optimizer='Adagrad', activation_fn=tf.nn.relu,
dropout=None, config=None, warm_start_from=None,
loss_reduction=losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE, batch_norm=False
)
By default activation function is "Relu". It could be possible that intermediate layer's generating a negative value and "Relu" convert it into the 0. Which gradually stops training.
I observed the "LeakyRelu" able to solve such problems.
Regularization can help. For a classifier, there is a good case for activity regularization, whether it is binary or a multi-class classifier. For a regressor, kernel regularization might be more appropriate.
The reason for nan
, inf
or -inf
often comes from the fact that division by 0.0
in TensorFlow doesn't result in a division by zero exception. It could result in a nan
, inf
or -inf
"value". In your training data you might have 0.0
and thus in your loss function it could happen that you perform a division by 0.0
.
a = tf.constant([2., 0., -2.])
b = tf.constant([0., 0., 0.])
c = tf.constant([1., 1., 1.])
print((a / b) + c)
Output is the following tensor:
tf.Tensor([ inf nan -inf], shape=(3,), dtype=float32)
Adding a small eplison
(e.g., 1e-5
) often does the trick. Additionally, since TensorFlow 2 the opteration tf.math.division_no_nan
is defined.
Source: Stackoverflow.com