Spark DataFrame TimestampType - how to get Year Month Day values from field

Question

I have Spark DataFrame with take 5  top rows as follows    Row date datetime datetime 1984  1  1  0  0   hour 1  value 638 55    Row date datetime datetime 1984  1  1  0  0   hour 2  value 638 55    Row date datetime datetime 1984  1  1  0  0   hour 3  value 638 55    Row date datetime datetime 1984  1  1  0  0   hour 4  value 638 55    Row date datetime datetime 1984  1  1  0  0   hour 5  value 638 55     It s schema is defined as   elevDF printSchema    root   -- date  timestamp  nullable   true    -- hour  long  nullable   true    -- value  double  nullable   true    How do I get the Year  Month  Day values from the  date  field

User · Accepted Answer

Since Spark 1 5 you can use a number of date processing functions    pyspark sql functions year pyspark sql functions month pyspark sql functions dayofmonth pyspark sql functions dayofweek   pyspark sql functions dayofyear pyspark sql functions weekofyear       import datetime from pyspark sql functions import year  month  dayofmonth  elevDF   sc parallelize        datetime datetime 1984  1  1  0  0   1  638 55        datetime datetime 1984  1  1  0  0   2  638 55        datetime datetime 1984  1  1  0  0   3  638 55        datetime datetime 1984  1  1  0  0   4  638 55        datetime datetime 1984  1  1  0  0   5  638 55     toDF   date    hour    value     elevDF select      year  date   alias  year         month  date   alias  month         dayofmonth  date   alias  day     show       ---- ----- ---      year month day      ---- ----- ---      1984     1   1      1984     1   1      1984     1   1      1984     1   1      1984     1   1      ---- ----- ---      You can use simple map as with any other RDD   elevDF   sqlContext createDataFrame sc parallelize           Row date datetime datetime 1984  1  1  0  0   hour 1  value 638 55           Row date datetime datetime 1984  1  1  0  0   hour 2  value 638 55           Row date datetime datetime 1984  1  1  0  0   hour 3  value 638 55           Row date datetime datetime 1984  1  1  0  0   hour 4  value 638 55           Row date datetime datetime 1984  1  1  0  0   hour 5  value 638 55       elevDF   map lambda  date  hour  value    date year  date month  date day     collect      and the result is     1984  1  1    1984  1  1    1984  1  1    1984  1  1    1984  1  1     Btw  datetime datetime stores an hour anyway so keeping it separately seems to be a waste of memory

User · Answer

Actually  we really do not need to import any python library  We can separate the year  month  date using simple SQL  See the below example   ----------           c0   ----------   1872-11-30   1873-03-08   1874-03-07   1875-03-06   1876-03-04   1876-03-25   1877-03-03   1877-03-05   1878-03-02   1878-03-23   1879-01-18   I have a date column in my data frame which contains the date  month and year and assume I want to extract only the year from the column  df createOrReplaceTempView  quot res quot   sqlDF   spark sql  quot SELECT EXTRACT year from   c0   FROM res  quot    Here I m creating a temporary view and store the year values using this single line and the output will be   -----------------------   year CAST  c0 AS DATE     -----------------------                      1872                      1873                      1874                      1875                      1876                      1876                      1877                      1877                      1878                      1878                      1879                      1879                      1879

User · Answer

You can use functions in pyspark sql functions  functions like year  month  etc  refer to here  https   spark apache org docs latest api python pyspark sql html pyspark sql DataFrame  from pyspark sql functions import    newdf   elevDF select year elevDF date  alias  dt year    month elevDF date  alias  dt month    dayofmonth elevDF date  alias  dt day    dayofyear elevDF date  alias  dt dayofy    hour elevDF date  alias  dt hour    minute elevDF date  alias  dt min    weekofyear elevDF date  alias  dt week no    unix timestamp elevDF date  alias  dt int     newdf show      ------- -------- ------ --------- ------- ------ ---------- ----------   dt year dt month dt day dt dayofy dt hour dt min dt week no     dt int   ------- -------- ------ --------- ------- ------ ---------- ----------      2015        9      6       249       0      0         36 1441497601      2015        9      6       249       0      0         36 1441497601      2015        9      6       249       0      0         36 1441497603      2015        9      6       249       0      1         36 1441497694      2015        9      6       249       0     20         36 1441498808      2015        9      6       249       0     20         36 1441498811      2015        9      6       249       0     20         36 1441498815

[python] Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

Examples related to python

Examples related to timestamp

Examples related to apache-spark

Examples related to pyspark