[apache-spark] How to show full column content in a Spark Dataframe?

I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content:

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv")
df.registerTempTable("tasks")
results = sqlContext.sql("select col from tasks");
results.show()

The col seems truncated:

scala> results.show();
+--------------------+
|                 col|
+--------------------+
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:15:...|
|2015-11-06 07:15:...|
|2015-11-16 07:15:...|
|2015-11-16 07:21:...|
|2015-11-16 07:21:...|
|2015-11-16 07:21:...|
+--------------------+

How do I show the full content of the column?

This question is related to apache-spark dataframe spark-csv output-formatting

The answer is


Below code would help to view all rows without truncation in each column

df.show(df.count(), False)

results.show(false) will show you the full column content.

Show method by default limit to 20, and adding a number before false will show more rows.


results.show(20,false) did the trick for me in Scala.


PYSPARK

In the below code, df is the name of dataframe. 1st parameter is to show all rows in the dataframe dynamically rather than hardcoding a numeric value. The 2nd parameter will take care of displaying full column contents since the value is set as False.

df.show(df.count(),False)

enter image description here


SCALA

In the below code, df is the name of dataframe. 1st parameter is to show all rows in the dataframe dynamically rather than hardcoding a numeric value. The 2nd parameter will take care of displaying full column contents since the value is set as false.

df.show(df.count().toInt,false)

enter image description here


try this command :

df.show(df.count())

If you put results.show(false) , results will not be truncated


In c# Option("truncate", false) does not truncate data in the output.

StreamingQuery query = spark
                    .Sql("SELECT * FROM Messages")
                    .WriteStream()
                    .OutputMode("append")
                    .Format("console")
                    .Option("truncate", false)
                    .Start();

I use the plugin Chrome extension works pretty well:

[https://userstyles.org/styles/157357/jupyter-notebook-wide][1]


The following answer applies to a Spark Streaming application.

By setting the "truncate" option to false, you can tell the output sink to display the full column.

val query = out.writeStream
          .outputMode(OutputMode.Update())
          .format("console")
          .option("truncate", false)
          .trigger(Trigger.ProcessingTime("5 seconds"))
          .start()

Within Databricks you can visualize the dataframe in a tabular format. With the command:

display(results)

It will look like

enter image description here


Tried this in pyspark

df.show(truncate=0)

results.show(20, False) or results.show(20, false) depending on whether you are running it on Java/Scala/Python


Try this in scala:

df.show(df.count.toInt, false)

The show method accepts an integer and a Boolean value but df.count returns Long...so type casting is required


The other solutions are good. If these are your goals:

  1. No truncation of columns,
  2. No loss of rows,
  3. Fast and
  4. Efficient

These two lines are useful ...

    df.persist
    df.show(df.count, false) // in Scala or 'False' in Python

By persisting, the 2 executor actions, count and show, are faster & more efficient when using persist or cache to maintain the interim underlying dataframe structure within the executors. See more about persist and cache.


Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to spark-csv

Provide schema while reading csv file as a dataframe How to show full column content in a Spark Dataframe? Write single CSV file using spark-csv

Examples related to output-formatting

How to show full column content in a Spark Dataframe? Removing display of row names from data frame TypeError: not all arguments converted during string formatting python Alternate output format for psql Controlling number of decimal digits in print output in R How to print the full NumPy array, without truncation?