As the other answers have described, lit
and typedLit
are how to add constant columns to DataFrames. lit
is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames.
You'll commonly be using lit
to create org.apache.spark.sql.Column
objects because that's the column type required by most of the org.apache.spark.sql.functions
.
Suppose you have a DataFrame with a some_date
DateType column and would like to add a column with the days between December 31, 2020 and some_date
.
Here's your DataFrame:
+----------+
| some_date|
+----------+
|2020-09-23|
|2020-01-05|
|2020-04-12|
+----------+
Here's how to calculate the days till the year end:
val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date"))
df
.withColumn("days_till_yearend", diff)
.show()
+----------+-----------------+
| some_date|days_till_yearend|
+----------+-----------------+
|2020-09-23| 99|
|2020-01-05| 361|
|2020-04-12| 263|
+----------+-----------------+
You could also use lit
to create a year_end
column and compute the days_till_yearend
like so:
import java.sql.Date
df
.withColumn("yearend", lit(Date.valueOf("2020-12-31")))
.withColumn("days_till_yearend", datediff(col("yearend"), col("some_date")))
.show()
+----------+----------+-----------------+
| some_date| yearend|days_till_yearend|
+----------+----------+-----------------+
|2020-09-23|2020-12-31| 99|
|2020-01-05|2020-12-31| 361|
|2020-04-12|2020-12-31| 263|
+----------+----------+-----------------+
Most of the time, you don't need to use lit
to append a constant column to a DataFrame. You just need to use lit
to convert a Scala type to a org.apache.spark.sql.Column
object because that's what's required by the function.
See the datediff
function signature:
As you can see, datediff
requires two Column arguments.