[python] sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have run

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

I tried using

mat[np.isfinite(mat) == True] = 0

to remove the infinite values but this did not work either. What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?

I am using anaconda and python 2.7.9.

This question is related to python python-2.7 scikit-learn valueerror

The answer is


This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

EDIT: How could I miss that:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check wheter any of the element is NaN, and not whether the return value of the any function is a number...


In most cases getting rid of infinite and null values solve this problem.

get rid of infinite values.

df.replace([np.inf, -np.inf], np.nan, inplace=True)

get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values

df.fillna(999, inplace=True)

I would like to propose a solution for numpy that worked well for me. The line

from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max

substitues all infite values of a numpy array with the maximum float64 number.


I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:

df = df.reset_index()

I encountered this issue many times when I removed some entries in my df, such as

df = df[df.label=='desired_one']

I had the error after trying to select a subset of rows:

df = df.reindex(index=my_index)

Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.


I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:

X = X.values.astype(np.float)
y = y.values.astype(np.float)

Edit: The originally suggested X.as_matrix() is Deprecated


Remove all infinite values:

(and replace with min or max for that column)

import numpy as np

# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[       inf,        inf,        inf,        inf,        inf],
       [0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
       [      -inf,       -inf,       -inf,       -inf,       -inf],
       [0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
       [0.90272002, 0.37357483, 0.92952479, 0.072105  , 0.20837798]])

# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]

# go through matrix one column at a time and replace  + and -infinity 
# with the max or min for that column
for i in range(matrix.shape[1]):
    matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
    matrix[:, i][matrix[:, i] == np.inf] = maxs[i]

>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
       [0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
       [0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
       [0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
       [0.90272002, 0.37357483, 0.92952479, 0.072105  , 0.20837798]])

None of the answers here worked for me. This was what worked.

Test_y = np.nan_to_num(Test_y)

It replaces the infinity values with high finite values and the nan values with numbers


In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.


try

mat.sum()

If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.

see the _assert_all_finite function in validation.py from the scikit source code:

if is_float and np.isfinite(X.sum()):
    pass
elif is_float:
    msg_err = "Input contains {} or a value too large for {!r}."
    if (allow_nan and np.isinf(X).any() or
            not allow_nan and not np.isfinite(X).all()):
        type_err = 'infinity' if allow_nan else 'NaN, infinity'
        # print(X.sum())
        raise ValueError(msg_err.format(type_err, X.dtype))

dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

This worked for me


i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc


This is the check on which it fails:

Which says

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.


This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):

import pandas as pd

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

With this version of python 3:

/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)

Looking at the details of the error, I found the lines of codes causing the failure:

/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     56             and not np.isfinite(X).all()):
     57         raise ValueError("Input contains NaN, infinity"
---> 58                          " or a value too large for %r." % X.dtype)
     59 
     60 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)

Then with a quick and dirty loop, I was able to find that my data indeed contains nans:

print(p[:,0].shape)
index = 0
for i in p[:,0]:
    if not np.isfinite(i):
        print(index, i)
    index +=1

(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...

Now all I have to do is remove the values at these indexes.


The Dimensions of my input array were skewed, as my input csv had empty spaces.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to scikit-learn

LabelEncoder: TypeError: '>' not supported between instances of 'float' and 'str' UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples scikit-learn random state in splitting dataset LogisticRegression: Unknown label type: 'continuous' using sklearn in python Can anyone explain me StandardScaler? ImportError: No module named model_selection How to split data into 3 sets (train, validation and test)? How to convert a Scikit-learn dataset to a Pandas dataset? Accuracy Score ValueError: Can't Handle mix of binary and continuous target How can I plot a confusion matrix?

Examples related to valueerror

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64') ValueError: unsupported pickle protocol: 3, python2 pickle can not load the file dumped by python 3 pickle? ValueError: shape mismatch: objects cannot be broadcast to a single shape ValueError: object too deep for desired array while using convolution