Convert spark DataFrame column to python list

Question

I work on a dataframe with two column  mvv and count    --- -----   mvv count   --- -----    1    5      2    9      3    3      4    1      i would like to obtain two list containing mvv values and count value  Something like  mvv    1 2 3 4  count    5 9 3 1    So  I tried the following code  The first line should return a python list of row  I wanted to see the first value   mvv list   mvv count df select  mvv   collect   firstvalue   mvv list 0  getInt 0    But I get an error message with the second line      AttributeError  getInt

User · Answer

Following one liner gives the list you want                                                                                                 mvv   mvv count df select  mvv   rdd flatMap lambda x  x  collect

User · Answer

Let s create the dataframe in question df test   spark createDataFrame                 1  5            2  9            3  3            4  1                mvv    count     df test show    Which gives  --- -----   mvv count   --- -----     1     5     2     9     3     3     4     1   --- -----   and then apply rdd flatMap f  collect   to get the list test list   df test select  quot mvv quot   rdd flatMap list  collect   print type test list   print test list   which gives  lt type  list  gt   1  2  3  4

User · Answer

If you get the error below       AttributeError   list  object has no attribute  collect    This code will solve your issues    mvv list   mvv count df select  mvv   collect    mvv array    int i mvv  for i in mvv list

User · Answer

This will give you all the elements as a list   mvv list   list      mvv count df select  mvv   toPandas    mvv

User · Answer

The following code will help you  mvv count df select  mvv   rdd map lambda row   row 0   collect

User · Answer

A possible solution is using the collect list   function from pyspark sql functions  This will aggregate all column values into a pyspark array that is converted into a python list when collected  mvv list     df select collect list  quot mvv quot    collect   0  0  count list   df select collect list  quot count quot    collect   0  0

User · Answer

I ran a benchmarking analysis and list mvv count df select  mvv   toPandas    mvv    is the fastest method   I m very surprised  I ran the different approaches on 100 thousand   100 million row datasets using a 5 node i3 xlarge cluster  each node has 30 5 GBs of RAM and 4 cores  with Spark 2 4 5   Data was evenly distributed on 20 snappy compressed Parquet files with a single column  Here s the benchmarking results  runtimes in seconds    ------------------------------------------------------------- --------- -------------                             Code                                 100 000   100 000 000    ------------------------------------------------------------- --------- -------------    df select  quot col name quot   rdd flatMap lambda x  x  collect            0 4   55 3            list df select  col name   toPandas    col name                   0 4   17 5            df select  col name   rdd map lambda row   row 0   collect        0 9   69               row 0  for row in df select  col name   collect                  1 0   OOM              r 0  for r in mid df select  col name   toLocalIterator          1 2                  ------------------------------------------------------------- --------- -------------     cancelled after 800 seconds  Golden rules to follow when collecting data on the driver node   Try to solve the problem with other approaches   Collecting data to the driver node is expensive  doesn t harness the power of the Spark cluster  and should be avoided whenever possible  Collect as few rows as possible   Aggregate  deduplicate  filter  and prune columns before collecting the data   Send as little data to the driver node as you can   toPandas was significantly improved in Spark 2 3   It s probably not the best approach if you re using a Spark version earlier than 2 3  See here for more details   benchmarking results

User · Answer

See  why this way that you are doing is not working  First  you are trying to get integer from a Row Type  the output of your collect is like this    gt  gt  gt  mvv list   mvv count df select  mvv   collect    gt  gt  gt  mvv list 0  Out  Row mvv 1    If you take something like this    gt  gt  gt  firstvalue   mvv list 0  mvv Out  1   You will get the mvv value  If you want all the information of the array you can take something like this    gt  gt  gt  mvv array    int row mvv  for row in mvv list collect     gt  gt  gt  mvv array Out   1 2 3 4    But if you try the same for the other column  you get    gt  gt  gt  mvv count    int row count  for row in mvv list collect    Out  TypeError  int   argument must be a string or a number  not  builtin function or method    This happens because count is a built-in method  And the column has the same name as count  A workaround to do this is change the column name of count to  count    gt  gt  gt  mvv list   mvv list selectExpr  mvv as mvv    count as  count    gt  gt  gt  mvv count    int row  count  for row in mvv list collect      But this workaround is not needed  as you can access the column using the dictionary syntax    gt  gt  gt  mvv array    int row  mvv    for row in mvv list collect     gt  gt  gt  mvv count    int row  count    for row in mvv list collect      And it will finally work

User · Answer

On my data I got these benchmarks    gt  gt  gt  data select col  rdd flatMap lambda x  x  collect     0 52 sec   gt  gt  gt   row col  for row in data collect      0 271 sec   gt  gt  gt  list data select col  toPandas   col     0 427 sec  The result is the same

[python] Convert spark DataFrame column to python list

Examples related to python

Examples related to apache-spark

Examples related to pyspark

Examples related to spark-dataframe