How to convert a Scikit-learn dataset to a Pandas dataset

Question

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame   from sklearn datasets import load iris import pandas as pd data   load iris   print type data   data1   pd    Is there a Pandas method to accomplish this

User · Answer

Working off the best answer and addressing my comment  here is a function for the conversion  def bunch to dataframe bunch     fnames   bunch feature names   features   fnames tolist   if isinstance fnames  np ndarray  else fnames   features      target     return pd DataFrame data  np c  bunch  data    bunch  target                      columns features

User · Answer

This works for me   dataFrame   pd dataFrame data   np c    iris  data   iris  target      columns iris  feature names   tolist       target

User · Answer

from sklearn datasets import load iris import pandas as pd  iris dataset   load iris    datasets   pd DataFrame iris dataset  data    columns               iris dataset  feature names    target val   pd Series iris dataset  target    name                 target values    species      for val in target val      if val    0          species append  iris-setosa       if val    1          species append  iris-versicolor       if val    2          species append  iris-virginica   species   pd Series species   datasets  target     target val datasets  target name     species datasets head

User · Answer

Here s another integrated method example maybe helpful  from sklearn datasets import load iris iris X  iris y   load iris return X y True  as frame True  type iris X   type iris y   The data iris X are imported as pandas DataFrame and the target iris y are imported as pandas Series

User · Answer

Other way to combine features and target variables can be using np column stack  details   import numpy as np import pandas as pd from sklearn datasets import load iris  data   load iris   df   pd DataFrame np column stack  data data  data target    columns   data feature names   target    print df head      Result      sepal length  cm   sepal width  cm   petal length  cm   petal width  cm      target 0                5 1               3 5                1 4               0 2     0 0 1                4 9               3 0                1 4               0 2     0 0  2                4 7               3 2                1 3               0 2     0 0  3                4 6               3 1                1 5               0 2     0 0 4                5 0               3 6                1 4               0 2     0 0   If you need the string label for the target  then you can use replace by convertingtarget names to dictionary and add a new column   df  label     df target replace dict enumerate data target names    print df head      Result      sepal length  cm   sepal width  cm   petal length  cm   petal width  cm      target  label  0                5 1               3 5                1 4               0 2     0 0     setosa 1                4 9               3 0                1 4               0 2     0 0     setosa 2                4 7               3 2                1 3               0 2     0 0     setosa 3                4 6               3 1                1 5               0 2     0 0     setosa 4                5 0               3 6                1 4               0 2     0 0     setosa

User · Answer

The API is a little cleaner than the responses suggested  Here  using as frame and being sure to include a response column as well   import pandas as pd from sklearn datasets import load wine  features  target   load wine as frame True  data  load wine as frame True  target df   features df  target     target  df head 2

User · Answer

This is easy method worked for me  boston   load boston   boston frame   pd DataFrame data boston data  columns boston feature names  boston frame  quot target quot     boston target  But this can applied to load iris as well

User · Answer

Whatever TomDLT answered it may not work for some of you because   data1   pd DataFrame data  np c  iris  data    iris  target                      columns  iris  feature names       target      because iris  feature names   returns you a numpy array  In numpy array you can t add an array and a list   target   by just   operator  Hence you need to convert it into a list first and then add   You can do   data1   pd DataFrame data  np c  iris  data    iris  target                      columns  list iris  feature names        target      This will work fine tho

User · Answer

import pandas as pd from sklearn datasets import load iris iris   load iris   X   iris  data   y   iris  target   iris df   pd DataFrame X  columns   iris  feature names    iris df head

User · Answer

I took couple of ideas from your answers and I don t know how to make it shorter     import pandas as pd from sklearn datasets import load iris iris   load iris   df   pd DataFrame iris data  columns iris  feature names    df  target     iris  target     This gives a Pandas DataFrame with feature names plus target as columns and RangeIndex start 0  stop len df   step 1   I would like to have a shorter code where I can have  target  added directly

User · Answer

from sklearn datasets import load iris import pandas as pd  data   load iris   df   pd DataFrame data data  columns data feature names  df head     This tutorial maybe of interest  http   www neural cz dataset-exploration-boston-house-pricing html

User · Answer

TOMDLt s solution is not generic enough for all the datasets in scikit-learn  For example it does not work for the boston housing dataset  I propose a different solution which is more universal  No need to use numpy as well   from sklearn import datasets import pandas as pd  boston data   datasets load boston   df boston   pd DataFrame boston data data columns boston data feature names  df boston  target     pd Series boston data target  df boston head     As a general function   def sklearn to df sklearn dataset       df   pd DataFrame sklearn dataset data  columns sklearn dataset feature names      df  target     pd Series sklearn dataset target      return df  df boston   sklearn to df datasets load boston

User · Answer

Update  2020 You can use the parameter as frame True to get pandas dataframes  If as frame parameter available  eg  load iris  from sklearn import datasets X y   datasets load iris return X y True    numpy arrays  dic data   datasets load iris as frame True  print dic data keys     df   dic data  frame     pandas dataframe data   target df X   dic data  data     pandas dataframe data only ser y   dic data  target     pandas series target only dic data  target names     numpy array   If as frame parameter NOT available  eg  load boston  from sklearn import datasets  fnames     i for i in dir datasets  if  load   in i  print fnames   fname    load boston  loader   getattr datasets fname    df   pd DataFrame loader  data   columns  loader  feature names    df  target     loader  target   df head 2

User · Answer

Just as an alternative that I could wrap my head around much easier   data   load iris   df   pd DataFrame data  data    columns data  feature names    df  target     data  target   df head     Basically instead of concatenating from the get go  just make a data frame with the matrix of features and then just add the target column with data  whatvername   and grab the target values from the dataset

User · Answer

You can use pd DataFrame constructor  giving a numpy array  data  and a list of the names of the columns  columns   To have everything in one DataFrame  you can concatenate the features and the target into one numpy array with np c        note the square brackets and not parenthesis   Also  you can have some trouble if you don t convert the feature names  iris  feature names    to a list before concatenation  import numpy as np import pandas as pd from sklearn datasets import load iris  iris   load iris    df   pd DataFrame data  np c  iris  data    iris  target                          columns  list iris  feature names        target

User · Answer

As of version 0 23  you can directly return a DataFrame using the as frame argument   For example  loading the iris data set   from sklearn datasets import load iris iris   load iris as frame True  df   iris data   In my understanding using the provisionally release notes  this works for the breast cancer  diabetes  digits  iris  linnerud  wine and california houses data sets

User · Answer

Basically what you need is the  data   and you have it in the scikit bunch  now you need just the  target   prediction  which is also in the bunch   So just need to concat these two to make the data complete     data df   pd DataFrame cancer data columns cancer feature names    target df   pd DataFrame cancer target columns   target       final df   data df join target df

User · Answer

Otherwise use seaborn data sets which are actual pandas data frames   import seaborn iris   seaborn load dataset  iris   type iris     lt class  pandas core frame DataFrame  gt    Compare with scikit learn data sets   from sklearn import datasets iris   datasets load iris   type iris     lt class  sklearn utils Bunch  gt  dir iris      DESCR    data    feature names    filename    target    target names

User · Answer

One of the best ways   data   pd DataFrame digits data    Digits is the sklearn dataframe and I converted it to a pandas DataFrame

User · Answer

Took me 2 hours to figure this out  import numpy as np import pandas as pd from sklearn datasets import load iris  iris   load iris     iris keys     df  pd DataFrame data  np c  iris  data    iris  target                      columns  iris  feature names       target     df  species     pd Categorical from codes iris target  iris target names    Get back the species for my pandas

User · Answer

Manually  you can use pd DataFrame constructor  giving a numpy array  data  and a list of the names of the columns  columns   To have everything in one DataFrame  you can concatenate the features and the target into one numpy array with np c        note the       import numpy as np import pandas as pd from sklearn datasets import load iris    save load iris   sklearn dataset to iris   if you d like to check dataset type use  type load iris      if you d like to view list of attributes use  dir load iris    iris   load iris      np c  is the numpy concatenate function   which is used to concat iris  data   and iris  target   arrays    for pandas column argument  concat iris  feature names   list   and string list  in this case one string   you can make this anything you d like       the original dataset would probably call this   Species   data1   pd DataFrame data  np c  iris  data    iris  target                          columns  iris  feature names       target

User · Answer

There might be a better way but here is what I have done in the past and it works quite well   items   data items                             Gets all the data from this Bunch - a huge list mydata   pd DataFrame items 1  1               Gets the Attributes mydata len mydata columns     items 2  1       Adds a column for the Target Variable mydata columns   items -1  1     items 2  0    Gets the column names and updates the dataframe   Now mydata will have everything you need - attributes  target variable and columnnames

User · Answer

This snippet is only syntactic sugar built upon what TomDLT and rolyat have already contributed and explained  The only differences would be that load iris will return a tuple instead of a dictionary and the columns names are enumerated   df   pd DataFrame np c  load iris return X y True

[dataset] How to convert a Scikit-learn dataset to a Pandas dataset?

Examples related to dataset

Examples related to scikit-learn

Examples related to pandas