[python] Deep-Learning Nan loss reasons

Perhaps too general a question, but can anyone explain what would cause a Convolutional Neural Network to diverge?

Specifics:

I am using Tensorflow's iris_training model with some of my own data and keep getting

ERROR:tensorflow:Model diverged with loss = NaN.

Traceback...

tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError: NaN loss during training.

Traceback originated with line:

 tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                        hidden_units=[300, 300, 300],
                                        #optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.001, l1_regularization_strength=0.00001),                                                          
                                        n_classes=11,
                                        model_dir="/tmp/iris_model")

I've tried adjusting the optimizer, using a zero for learning rate, and using no optimizer. Any insights into network layers, data size, etc is appreciated.

This question is related to python tensorflow machine-learning keras theano

The answer is


I'd like to plug in some (shallow) reasons I have experienced as follows:

  1. we may have updated our dictionary(for NLP tasks) but the model and the prepared data used a different one.
  2. we may have reprocessed our data(binary tf_record) but we loaded the old model. The reprocessed data may conflict with the previous one.
  3. we may should train the model from scratch but we forgot to delete the checkpoints and the model loaded the latest parameters automatically.

Hope that helps.


If you're training for cross entropy, you want to add a small number like 1e-8 to your output probability.

Because log(0) is negative infinity, when your model trained enough the output distribution will be very skewed, for instance say I'm doing a 4 class output, in the beginning my probability looks like

0.25 0.25 0.25 0.25

but toward the end the probability will probably look like

1.0 0 0 0

And you take a cross entropy of this distribution everything will explode. The fix is to artifitially add a small number to all the terms to prevent this.


If using integers as targets, makes sure they aren't symmetrical at 0.

I.e., don't use classes -1, 0, 1. Use instead 0, 1, 2.


In my case I got NAN when setting distant integer LABELs. ie:

  • Labels [0..100] the training was ok,
  • Labels [0..100] plus one additional label 8000, then I got NANs.

So, not use a very distant Label.

EDIT You can see the effect in the following simple code:

from keras.models import Sequential
from keras.layers import Dense, Activation
import numpy as np

X=np.random.random(size=(20,5))
y=np.random.randint(0,high=5, size=(20,1))

model = Sequential([
            Dense(10, input_dim=X.shape[1]),
            Activation('relu'),
            Dense(5),
            Activation('softmax')
            ])
model.compile(optimizer = "Adam", loss = "sparse_categorical_crossentropy", metrics = ["accuracy"] )

print('fit model with labels in range 0..5')
history = model.fit(X, y, epochs= 5 )

X = np.vstack( (X, np.random.random(size=(1,5))))
y = np.vstack( ( y, [[8000]]))
print('fit model with labels in range 0..5 plus 8000')
history = model.fit(X, y, epochs= 5 )

The result shows the NANs after adding the label 8000:

fit model with labels in range 0..5
Epoch 1/5
20/20 [==============================] - 0s 25ms/step - loss: 1.8345 - acc: 0.1500
Epoch 2/5
20/20 [==============================] - 0s 150us/step - loss: 1.8312 - acc: 0.1500
Epoch 3/5
20/20 [==============================] - 0s 151us/step - loss: 1.8273 - acc: 0.1500
Epoch 4/5
20/20 [==============================] - 0s 198us/step - loss: 1.8233 - acc: 0.1500
Epoch 5/5
20/20 [==============================] - 0s 151us/step - loss: 1.8192 - acc: 0.1500
fit model with labels in range 0..5 plus 8000
Epoch 1/5
21/21 [==============================] - 0s 142us/step - loss: nan - acc: 0.1429
Epoch 2/5
21/21 [==============================] - 0s 238us/step - loss: nan - acc: 0.2381
Epoch 3/5
21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
Epoch 4/5
21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
Epoch 5/5
21/21 [==============================] - 0s 188us/step - loss: nan - acc: 0.2381

If you'd like to gather more information on the error and if the error occurs in the first few iterations, I suggest you run the experiment in CPU-only mode (no GPUs). The error message will be much more specific.

Source: https://github.com/tensorflow/tensor2tensor/issues/574


Although most of the points are already discussed. But I would like to highlight again one more reason for NaN which is missing.

tf.estimator.DNNClassifier(
    hidden_units, feature_columns, model_dir=None, n_classes=2, weight_column=None,
    label_vocabulary=None, optimizer='Adagrad', activation_fn=tf.nn.relu,
    dropout=None, config=None, warm_start_from=None,
    loss_reduction=losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE, batch_norm=False
)

By default activation function is "Relu". It could be possible that intermediate layer's generating a negative value and "Relu" convert it into the 0. Which gradually stops training.

I observed the "LeakyRelu" able to solve such problems.


Regularization can help. For a classifier, there is a good case for activity regularization, whether it is binary or a multi-class classifier. For a regressor, kernel regularization might be more appropriate.


The reason for nan, inf or -inf often comes from the fact that division by 0.0 in TensorFlow doesn't result in a division by zero exception. It could result in a nan, inf or -inf "value". In your training data you might have 0.0 and thus in your loss function it could happen that you perform a division by 0.0.

a = tf.constant([2., 0., -2.])
b = tf.constant([0., 0., 0.])
c = tf.constant([1., 1., 1.])
print((a / b) + c)

Output is the following tensor:

tf.Tensor([ inf  nan -inf], shape=(3,), dtype=float32)

Adding a small eplison (e.g., 1e-5) often does the trick. Additionally, since TensorFlow 2 the opteration tf.math.division_no_nan is defined.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to tensorflow

Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation Module 'tensorflow' has no attribute 'contrib' Tensorflow 2.0 - AttributeError: module 'tensorflow' has no attribute 'Session' Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: How do I use TensorFlow GPU? Which TensorFlow and CUDA version combinations are compatible? Could not find a version that satisfies the requirement tensorflow pip3: command not found How to import keras from tf.keras in Tensorflow? Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

Examples related to machine-learning

Error in Python script "Expected 2D array, got 1D array instead:"? How to predict input image using trained model in Keras? What is the role of "Flatten" in Keras? How to concatenate two layers in keras? How to save final model using keras? scikit-learn random state in splitting dataset Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? What is the meaning of the word logits in TensorFlow? Can anyone explain me StandardScaler? Can Keras with Tensorflow backend be forced to use CPU or GPU at will?

Examples related to keras

Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Tensorflow 2.0 - AttributeError: module 'tensorflow' has no attribute 'Session' What is the use of verbose in Keras while validating the model? Save and load weights in keras How to import keras from tf.keras in Tensorflow? How to check which version of Keras is installed? Can I run Keras model on gpu? How to check if keras tensorflow backend is GPU or CPU version? Keras input explanation: input_shape, units, batch_size, dim, etc

Examples related to theano

Deep-Learning Nan loss reasons Keras, how do I predict after I trained a model? Keras model.summary() result - Understanding the # of Parameters How do I install Keras and Theano in Anaconda Python on Windows?