I have Spark DataFrame with take(5) top rows as follows:
[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]
It's schema is defined as:
elevDF.printSchema()
root
|-- date: timestamp (nullable = true)
|-- hour: long (nullable = true)
|-- value: double (nullable = true)
How do I get the Year, Month, Day values from the 'date' field?
This question is related to
python
timestamp
apache-spark
pyspark
Actually, we really do not need to import any python library. We can separate the year, month, date using simple SQL. See the below example,
+----------+
| _c0|
+----------+
|1872-11-30|
|1873-03-08|
|1874-03-07|
|1875-03-06|
|1876-03-04|
|1876-03-25|
|1877-03-03|
|1877-03-05|
|1878-03-02|
|1878-03-23|
|1879-01-18|
I have a date column in my data frame which contains the date, month and year and assume I want to extract only the year from the column.
df.createOrReplaceTempView("res")
sqlDF = spark.sql("SELECT EXTRACT(year from `_c0`) FROM res ")
Here I'm creating a temporary view and store the year values using this single line and the output will be,
+-----------------------+
|year(CAST(_c0 AS DATE))|
+-----------------------+
| 1872|
| 1873|
| 1874|
| 1875|
| 1876|
| 1876|
| 1877|
| 1877|
| 1878|
| 1878|
| 1879|
| 1879|
| 1879|
You can use functions in pyspark.sql.functions
: functions like year
, month
, etc
refer to here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
from pyspark.sql.functions import *
newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int'))
newdf.show()
+-------+--------+------+---------+-------+------+----------+----------+
|dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no| dt_int|
+-------+--------+------+---------+-------+------+----------+----------+
| 2015| 9| 6| 249| 0| 0| 36|1441497601|
| 2015| 9| 6| 249| 0| 0| 36|1441497601|
| 2015| 9| 6| 249| 0| 0| 36|1441497603|
| 2015| 9| 6| 249| 0| 1| 36|1441497694|
| 2015| 9| 6| 249| 0| 20| 36|1441498808|
| 2015| 9| 6| 249| 0| 20| 36|1441498811|
| 2015| 9| 6| 249| 0| 20| 36|1441498815|
Source: Stackoverflow.com