Take n rows from a spark dataframe and pass to toPandas

Question

I have this code   l      Alice   1    Jim  2    Sandra  3   df   sqlContext createDataFrame l    name    age    df withColumn  age2   df age   2  toPandas     Works fine  does what it needs to  Suppose though I only want to display the first n rows  and then call toPandas   to return a pandas dataframe  How do I do it  I can t call take n  because that doesn t return a dataframe and thus I can t pass it to toPandas     So to put it another way  how can I take the top n rows from a dataframe and call toPandas   on the resulting dataframe  Can t think this is difficult but I can t figure it out   I m using Spark 1 6 0

User · Answer

You could get first rows of Spark DataFrame with head and then create Pandas DataFrame   l      Alice   1    Jim  2    Sandra  3   df   sqlContext createDataFrame l    name    age     df pandas   pd DataFrame df head 3   columns df columns   In  4   df pandas Out 4         name  age 0   Alice    1 1     Jim    2 2  Sandra    3

User · Answer

You can use the limit n  function   l      Alice   1    Jim  2    Sandra  3   df   sqlContext createDataFrame l    name    age    df limit 2  withColumn  age2   df age   2  toPandas     Or   l      Alice   1    Jim  2    Sandra  3   df   sqlContext createDataFrame l    name    age    df withColumn  age2   df age   2  limit 2  toPandas

User · Answer

Try it   def showDf df  count None  percent None  maxColumns 0       if  df    None   return     import pandas     from IPython display import display     pandas set option  display encoding    UTF-8         Pandas dataframe     dfp   None       maxColumns param     if  maxColumns  gt   0           if  maxColumns    0   maxColumns   len df columns          pandas set option  display max columns   maxColumns        count param     if  count    None and percent    None   count   10   Default count     if  count    None           count   int count          if  count    0   count   df count           pandas set option  display max rows   count          dfp   pandas DataFrame df head count   columns df columns          display dfp        percent param     elif  percent    None           percent   float percent          if  percent  gt  0 0 and percent  lt   1 0               import datetime             now   datetime datetime now               seed   long now strftime   H M S                dfs   df sample False  percent  seed              count   df count               pandas set option  display max rows   count              dfp   dfs toPandas                   display dfp    Examples of usages are     Shows the ten first rows of the Spark dataframe showDf df  showDf df  10  showDf df  count 10     Shows a random sample which represents 15  of the Spark dataframe showDf df  percent 0 15

[python] Take n rows from a spark dataframe and pass to toPandas()

Examples related to python

Examples related to apache-spark-sql

Examples related to spark-dataframe