Deep-Learning Nan loss reasons

Question

Perhaps too general a question  but can anyone explain what would cause a Convolutional Neural Network to diverge  Specifics  I am using Tensorflow s iris training model with some of my own data and keep getting  ERROR tensorflow Model diverged with loss   NaN  Traceback    tensorflow contrib learn python learn monitors NanLossDuringTrainingError  NaN loss during training   Traceback originated with line   tf contrib learn DNNClassifier feature columns feature columns                                          hidden units  300  300  300                                            optimizer tf train ProximalAdagradOptimizer learning rate 0 001  l1 regularization strength 0 00001                                                                                                     n classes 11                                          model dir  quot  tmp iris model quot    I ve tried adjusting the optimizer  using a zero for learning rate  and using no optimizer  Any insights into network layers  data size  etc is appreciated

User · Answer

If using integers as targets  makes sure they aren t symmetrical at 0    I e   don t use classes -1  0  1  Use instead 0  1  2

User · Answer

Although most of the points are already discussed  But I would like to highlight again one more reason for NaN which is missing  tf estimator DNNClassifier      hidden units  feature columns  model dir None  n classes 2  weight column None      label vocabulary None  optimizer  Adagrad   activation fn tf nn relu      dropout None  config None  warm start from None      loss reduction losses utils ReductionV2 SUM OVER BATCH SIZE  batch norm False    By default activation function is  quot Relu quot   It could be possible that intermediate layer s generating a negative value and  quot Relu quot  convert it into the 0  Which gradually stops training  I observed the  quot LeakyRelu quot  able to solve such problems

User · Answer

The reason for nan  inf or -inf often comes from the fact that division by 0 0 in TensorFlow doesn t result in a division by zero exception  It could result in a nan  inf or -inf  quot value quot   In your training data you might have 0 0 and thus in your loss function it could happen that you perform a division by 0 0  a   tf constant  2   0   -2    b   tf constant  0   0   0    c   tf constant  1   1   1    print  a   b    c   Output is the following tensor  tf Tensor   inf  nan -inf   shape  3    dtype float32   Adding a small eplison  e g   1e-5  often does the trick  Additionally  since TensorFlow 2 the opteration tf math division no nan is defined

User · Answer

If you d like to gather more information on the error and if the error occurs in the first few iterations  I suggest you run the experiment in CPU-only mode  no GPUs    The error message will be much more specific     Source  https   github com tensorflow tensor2tensor issues 574

User · Answer

I d like to plug in some  shallow  reasons I have experienced as follows         we may have updated our dictionary for NLP tasks  but the model and the prepared data used a different one      we may have reprocessed our data binary tf record  but we loaded the old model  The reprocessed data may conflict with the previous one      we may should train the model from scratch but we forgot to delete the checkpoints and the model loaded the latest parameters automatically    Hope that helps

User · Answer

In my case I got NAN when setting distant integer LABELs  ie    Labels  0  100  the training was ok  Labels  0  100  plus one additional label 8000  then I got NANs    So  not use a very distant Label   EDIT You can see the effect in the following simple code   from keras models import Sequential from keras layers import Dense  Activation import numpy as np  X np random random size  20 5   y np random randint 0 high 5  size  20 1    model   Sequential               Dense 10  input dim X shape 1                Activation  relu                Dense 5               Activation  softmax                  model compile optimizer    Adam   loss    sparse categorical crossentropy   metrics     accuracy      print  fit model with labels in range 0  5   history   model fit X  y  epochs  5    X   np vstack   X  np random random size  1 5     y   np vstack    y    8000     print  fit model with labels in range 0  5 plus 8000   history   model fit X  y  epochs  5     The result shows the NANs after adding the label 8000   fit model with labels in range 0  5 Epoch 1 5 20 20                                  - 0s 25ms step - loss  1 8345 - acc  0 1500 Epoch 2 5 20 20                                  - 0s 150us step - loss  1 8312 - acc  0 1500 Epoch 3 5 20 20                                  - 0s 151us step - loss  1 8273 - acc  0 1500 Epoch 4 5 20 20                                  - 0s 198us step - loss  1 8233 - acc  0 1500 Epoch 5 5 20 20                                  - 0s 151us step - loss  1 8192 - acc  0 1500 fit model with labels in range 0  5 plus 8000 Epoch 1 5 21 21                                  - 0s 142us step - loss  nan - acc  0 1429 Epoch 2 5 21 21                                  - 0s 238us step - loss  nan - acc  0 2381 Epoch 3 5 21 21                                  - 0s 191us step - loss  nan - acc  0 2381 Epoch 4 5 21 21                                  - 0s 191us step - loss  nan - acc  0 2381 Epoch 5 5 21 21                                  - 0s 188us step - loss  nan - acc  0 2381

User · Answer

There are lots of things I have seen make a model diverge    Too high of a learning rate   You can often tell if this is the case if the loss begins to increase and then diverges to infinity    I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function   This involves taking the log of the prediction which diverges as the prediction approaches zero   That is why people usually add a small epsilon value to the prediction to prevent this divergence  I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it   Probably not the issue  Other numerical stability issues can exist such as division by zero where adding the epsilon can help   Another less obvious one if the square root who s derivative can diverge if not properly simplified when dealing with finite precision numbers  Yet again I doubt this is the issue in the case of the DNNClassifier  You may have an issue with the input data   Try calling assert not np any np isnan x   on the input data to make sure you are not introducing the nan   Also make sure all of the target values are valid   Finally  make sure the data is properly normalized  You probably want to have the pixels in the range  -1  1  and not  0  255   The labels must be in the domain of the loss function  so if using a logarithmic-based loss function all labels must be non-negative  as noted by evan pu and the comments below

User · Answer

Regularization can help  For a classifier  there is a good case for activity regularization  whether it is binary or a multi-class classifier  For a regressor  kernel regularization might be more appropriate

User · Answer

If you re training for cross entropy  you want to add a small number like 1e-8 to your output probability   Because log 0  is negative infinity  when your model trained enough the output distribution will be very skewed  for instance say I m doing a 4 class output  in the beginning my probability looks like  0 25 0 25 0 25 0 25   but toward the end the probability will probably look like  1 0 0 0 0   And you take a cross entropy of this distribution everything will explode  The fix is to artifitially add a small number to all the terms to prevent this

[python] Deep-Learning Nan loss reasons

Examples related to python

Examples related to tensorflow

Examples related to machine-learning

Examples related to keras

Examples related to theano