Pyspark display a spark data frame in a table format

Question

I am using pyspark to read a parquet file like below   my df   sqlContext read parquet  hdfs   myPath myDB db myTable        Then when I do my df take 5   it will show  Row        instead of a table  format like when we use the pandas data frame   Is it possible to display the data frame in a table format like pandas data frame  Thanks

User · Answer

As mentioned by  Brent in the comment of  maxymoo s answer  you can try   df limit 10  toPandas     to get a prettier table in Jupyter  But this can take some time to run if you are not caching the spark dataframe  Also   limit   will not keep the order of original spark dataframe

User · Answer

Let s say we have the following Spark DataFrame  df   sqlContext createDataFrame                 1   quot Mark quot    quot Brown quot              2   quot Tom quot    quot Anderson quot              3   quot Joshua quot    quot Peterson quot                 id    firstName    lastName       There are typically three different ways you can use to print the content of the dataframe  Print Spark DataFrame The most common way is to use show   function   gt  gt  gt  df show    --- --------- --------    id firstName lastName   --- --------- --------     1      Mark    Brown     2       Tom Anderson     3    Joshua Peterson   --- --------- --------   Print Spark DataFrame vertically Say that you have a fairly large number of columns and your dataframe doesn t fit in the screen  You can print the rows vertically - For example  the following command will print the top two rows  vertically  without any truncation   gt  gt  gt  df show n 2  truncate False  vertical True  -RECORD 0-------------  id          1          firstName   Mark       lastName    Brown     -RECORD 1-------------  id          2          firstName   Tom        lastName    Anderson  only showing top 2 rows  Convert to Pandas and print Pandas DataFrame Alternatively  you can convert your Spark DataFrame into a Pandas DataFrame using  toPandas   and finally print   it   gt  gt  gt  df pd   df toPandas    gt  gt  gt  print df pd     id firstName  lastName 0   1      Mark     Brown 1   2       Tom  Anderson 2   3    Joshua  Peterson  Note that this is not recommended when you have to deal with fairly large dataframes  as Pandas needs to load all the data into memory  If this is the case  the following configuration will help when converting a large spark dataframe to a pandas one  spark conf set  quot spark sql execution arrow pyspark enabled quot    quot true quot    For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames

User · Answer

Yes  call the toPandas method on your dataframe and you ll get an actual pandas dataframe

User · Answer

The show method does what you re looking for   For example  given the following dataframe of 3 rows  I can print just the first two rows like this   df   sqlContext createDataFrame    foo   1     bar   2     baz   3      k    v    df show n 2    which yields    --- ---     k   v   --- ---   foo   1   bar   2   --- ---  only showing top 2 rows

[python] Pyspark: display a spark data frame in a table format

Examples related to python

Examples related to pandas

Examples related to pyspark

Examples related to spark-dataframe