How to use numpy genfromtxt when first column is string and the remaining columns are numbers

Question

Basically  I have a bunch of data where the first column is a string  label  and the remaining columns are numeric values  I run the following   data   numpy genfromtxt  data txt   delimiter          This reads most of the data well  but the label column just gets  nan    How can I deal with this

User · Answer

You can use numpy recfromcsv filename   the types of each column will be automatically determined  as if you use np genfromtxt   with dtype None   and by default delimiter      It s basically a shortcut for np genfromtxt filename  delimiter      dtype None  that Pierre GM pointed at in his answer

User · Answer

If your data file is structured like this  col1  col2  col3    1     2     3   10    20    30  100   200   300   then numpy genfromtxt can interpret the first line as column headers using the names True option  With this you can access the data very conveniently by providing the column header   data   np genfromtxt  data txt   delimiter      names True  print data  col1        array     1     10    100    print data  col2        array     2     20    200    print data  col3        array     3     30    300      Since in your case the data is formed like this  row1    1   10  100 row2    2   20  200 row3    3   30  300   you can achieve something similar using the following code snippet   labels   np genfromtxt  data txt   delimiter      usecols 0  dtype str  raw data   np genfromtxt  data txt   delimiter        1   data    label  row for label  row in zip labels  raw data     The first line reads the first column  the labels  into an array of strings  The second line reads all data from the file but discards the first column  The third line uses dictionary comprehension to create a dictionary that can be used very much like the structured array which numpy genfromtxt creates using the names True option   print data  row1        array     1     10    100    print data  row2        array     2     20    200    print data  row3        array     3     30    300

User · Answer

For a dataset of this format   CONFIG000   1080 65 1080 87 1068 76 1083 52 1084 96 1080 31 1081 75 1079 98 CONFIG001   414 6   421 76  418 93  415 53  415 23  416 12  420 54  415 42 CONFIG010   1091 43 1079 2  1086 61 1086 58 1091 14 1080 58 1076 64 1083 67 CONFIG011   391 31  392 96  391 24  392 21  391 94  392 18  391 96  391 66 CONFIG100   1067 08 1062 1  1061 02 1068 24 1066 74 1052 38 1062 31 1064 28 CONFIG101   371 63  378 36  370 36  371 74  370 67  376 24  378 15  371 56 CONFIG110   1060 88 1072 13 1076 01 1069 52 1069 04 1068 72 1064 79 1066 66 CONFIG111   350 08  350 69  352 1   350 19  352 28  353 46  351 83  350 94   This code works for my application   def ShowData data  names       i   0     while i  lt  data shape 0           print names i                  j   0         while j  lt  data shape 1               print data i  j               j    1         print             i    1  def Main        print  The sample data is         fname    ANOVA csv      csv   numpy genfromtxt fname  dtype str  delimiter          num rows   csv shape 0      num cols   csv shape 1      names   csv   0      data   numpy genfromtxt fname  usecols   range 1 num cols   delimiter          print names      print str num rows     x    str num cols       print data      ShowData data  names    Python-2 output   The sample data is    CONFIG000   CONFIG001   CONFIG010   CONFIG011   CONFIG100   CONFIG101    CONFIG110   CONFIG111   8x9    1080 65  1080 87  1068 76  1083 52  1084 96  1080 31  1081 75  1079 98      414 6    421 76   418 93   415 53   415 23   416 12   420 54   415 42     1091 43  1079 2   1086 61  1086 58  1091 14  1080 58  1076 64  1083 67      391 31   392 96   391 24   392 21   391 94   392 18   391 96   391 66     1067 08  1062 1   1061 02  1068 24  1066 74  1052 38  1062 31  1064 28      371 63   378 36   370 36   371 74   370 67   376 24   378 15   371 56     1060 88  1072 13  1076 01  1069 52  1069 04  1068 72  1064 79  1066 66      350 08   350 69   352 1    350 19   352 28   353 46   351 83   350 94   CONFIG000  1080 65 1080 87 1068 76 1083 52 1084 96 1080 31 1081 75 1079 98  CONFIG001  414 6 421 76 418 93 415 53 415 23 416 12 420 54 415 42  CONFIG010  1091 43 1079 2 1086 61 1086 58 1091 14 1080 58 1076 64 1083 67  CONFIG011  391 31 392 96 391 24 392 21 391 94 392 18 391 96 391 66  CONFIG100  1067 08 1062 1 1061 02 1068 24 1066 74 1052 38 1062 31 1064 28  CONFIG101  371 63 378 36 370 36 371 74 370 67 376 24 378 15 371 56  CONFIG110  1060 88 1072 13 1076 01 1069 52 1069 04 1068 72 1064 79 1066 66  CONFIG111  350 08 350 69 352 1 350 19 352 28 353 46 351 83 350 94

User · Answer

By default  np genfromtxt uses dtype float  that s why you string columns are converted to NaNs because  after all  they re Not A Number      You can ask np genfromtxt to try to guess the actual type of your columns by using dtype None    gt  gt  gt  from StringIO import StringIO  gt  gt  gt  test    a 1 2 nb 3 4   gt  gt  gt  a   np genfromtxt StringIO test   delimiter      dtype None   gt  gt  gt  print a array    a  1 2    b  3 4    dtype    f0     S1     f1     lt i8     f2     lt i8       You can access the columns by using their name  like a  f0       Using dtype None is a good trick if you don t know what your columns should be  If you already know what type they should have  you can give an explicit dtype  For example  in our test  we know that the first column is a string  the second an int  and we want the third to be a float  We would then use   gt  gt  gt  np genfromtxt StringIO test   delimiter      dtype    S10   int  float   array    a   1  2 0     b   3  4 0           dtype    f0     S10      f1     lt i8      f2     lt f8       Using an explicit dtype is much more efficient than using dtype None and is the recommended way    In both cases  dtype None or explicit  non-homogeneous dtype   you end up with a structured array     Note  With dtype None  the input is parsed a second time and the type of each column is updated to match the larger type possible  first we try a bool  then an int  then a float  then a complex  then we keep a string if all else fails  The implementation is rather clunky  actually  There had been some attempts to make the type guessing more efficient  using regexp   but nothing that stuck so far

User · Answer

data np genfromtxt csv file  delimiter      dtype  unicode    It works fine for me

[python] How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?

Examples related to python

Examples related to numpy