Learn Spark SQL Date and Time Functions

By:   |   Updated: 2023-11-17   |   Comments   |   Related: > Apache Spark


Problem

The big data engineer can choose to use Spark dataframe methods or Spark SQL functions to transform their information. When working with clients, I lean towards using the SQL syntax since it is more widely used than dataframe methods. Every language has at least three core data types: strings, numbers, and date/time. How do we manipulate data/time values using Spark SQL?

Solution

In the next three articles, I will review the syntax for string, number, and date/time Spark SQL functions. This tip will focus on learning the available date/time functions.

Business Problem

Explore the syntax of the date/time functions available in Azure Databricks. When appropriate, we will use the Adventure Works data files that are exposed as hive tables in my star database. In the examples, I will execute Spark SQL using the magic command in a Python notebook. The code below lists the tables in the star database. This is how I test my SQL syntax before embedding it in a spark.sql call.

%sql
use star;
show tables;

The output shows the 20 tables available in the database. We will only use the fact table for internet sales.

Spark SQL - Date + Time Functions - list table in adventures schema

Current Date, Time, or Zone

The Spark ecosystem has evolved since its creation in 2012. The engine is written in Scala, but the library (PySpark) that we are using in the examples is written in Python. Therefore, it is not surprising that some functions return the same result but are named slightly differently.

%sql
--
-- 1 - Current date, time or zone
--
 
SELECT 
  curdate() as the_date1,
  current_date() as the_date2,
  current_timestamp() as the_timestamp1,
  now() as the_timestamp2,
  current_timezone() as the_timezone1, 
  from_utc_timestamp(current_timestamp(), 'America/Detroit') as the_est_time

The above code snippet returns the current date, timestamp, and time zone. The hardest part is to find the correct string to use for time zones. Please see the Wikipedia link for more information on time zones. The table below provides links to the functions that were explored in this simple query.

Example Function Description
1 curdate Return date object
2 current_date Return date object
3 current_timestamp Return timestamp object (UTC)
4 now Return timestamp object (UTC)
5 current_timezone Return a string representation of zone
6 from_utc_timestamp Time for zone without offset.

The output below shows the results of executing the Spark SQL query. In short, retrieving the current date, timestamp, or time zone is relatively easy.

Spark SQL - Date + Time Functions - current date, current timestamp and current time zone.

Date Categories

The Adventure Works company ships products all over the world. While the order date might be interesting, it might be more interesting to find the month that has the most orders. How do we convert a date into a numeric category for aggregation? The following Spark SQL query finds the year, quarter, month, week, and day for a given order date in the fact table for internet sales.

%sql
--
-- 2 - year, quarter, month, week and day functions
--
 
select 
  OrderDate,  
  year(OrderDate) as order_date_year,
  quarter(OrderDate) as order_date_qtr,
  month(OrderDate) as order_date_month,
  weekofyear(OrderDate) as order_date_week_no,
  day(OrderDate) as order_date_day
from 
  star.fact_internet_sales
limit 5;

Please see the table below that has links to the functions that were explored in this simple query.

Example Function Description
1 year Return date part as integer.
2 quarter Calculate quarter of year as an integer.
3 month Return date part as integer.
4 weekofyear Calculate week of year as an integer.
5 day Return date part as integer.

The output below shows the results of executing the Spark SQL query for testing date category functions. I used the limit statement to list just the first five rows.

Spark SQL - Date + Time Functions - Date Categories

Converting a date into a numeric category used for grouping is a very common task. I cannot count how many times I have written Spark SQL queries that group by a given date part.

Day Of Functions

The business line sometimes wants to know the day of the year, the day of the month, or the day of the week an order was placed. The Spark SQL language has two day of week functions; the only difference is how the enumeration is converted to a category. The dayofweek function is one-based and starts on Sunday, while the weekday function is zero-based and starts on Monday.

The following Spark SQL query uses the order date in the fact table for internet sales to explore these functions.

%sql
--
-- 3 - day of functions (year, month, week)
-- 
 
select 
  OrderDate,
  dayofyear(OrderDate) as order_date_doy,
  dayofmonth(OrderDate) as order_date_dom,
  dayofweek(OrderDate) as order_date_dow1,
  weekday(OrderDate) as order_date_dow2
from
  star.fact_internet_sales
limit 5;

The output below shows the results of executing the Spark SQL query for day of functions. I used the limit statement to list only the first five rows.

Spark SQL - Date + Time Functions - Day Of Functions

If we look at the second record in the list, the calendar date of 12-12-2012 occurred on a Wednesday. It is 19 days away from the end of the year. The total remaining days and day of the year equals 366 days, which makes sense since it was a leap year. The last two fields contain the day of the week, which is correct. The developer needs to translate this numeric into a string representing the day.

Please see the table below for links to the functions explored in this simple query.

Example Function Description
1 dayofyear Return day count as integer.
2 dayofmonth Return day count as integer.
3 dayofweek Return day count as integer.
4 weekday Return day count as integer.

I do not use the day of functions as much; however, they are available for your use in your Spark SQL queries.

String To Date Functions

Each computer system and/or language has different precision on date/time values. Therefore, it is not uncommon to receive a date or timestamp in a string format within a data file. It is up to the data engineer to convert the string to the correct data type.

The following Spark SQL query uses my sister's birthday as a string to test the two available functions. This is a dedication to my sister since she lost her battle with cancer in 2017. I truly miss her.

%sql
--
-- 4 - string to date functions
--
 
select 
  to_date('1978-09-06', 'yyyy-MM-dd') as to_date_fmt_ymd,
  to_date('06-09-1978', 'dd-MM-yyyy') as to_date_fmt_dmy,
  date('1978-09-06') as str_to_date

The output below shows the results of executing the Spark SQL query for the string to date functions.

Spark SQL - Date + Time Functions - String to Date functions

Please see the table below that has links to the string to date functions that were explored in this simple query.

Example Function Description
1 to_date Accepts many different formats.
2 date Returns date object

The string to date functions exist not only in Spark SQL but in most relational databases. It is not uncommon to see the use of these functions in everyday Spark SQL queries.

Time Categories

It might be interesting to look at all the Adventure Works orders placed by hour. That way, we can have staff to cover the sales call volume for a given hour. We can't get valid results for this query since the source system is keeping track of order dates, not order timestamps. Thus, all orders happen at midnight or at zero hours. If we wanted this information in our data lake, we would have to change the upstream system to include time.

The example below uses the current timestamp to show how to extract the hour, minute, and second.

%sql
--
-- 5 - parts of time as numerics
--
 
select 
  current_timestamp() as cur_time,
  hour(current_timestamp()) as cur_hour,
  minute(current_timestamp()) as cur_minute,
  second(current_timestamp()) as cur_second

The output below shows the results of executing the Spark SQL query for time part functions.

Spark SQL - Date + Time Functions - Time Categories

Please see the table below with links to the functions explored in this simple query.

Example Function Description
1 hour Return hours as an integer
2 minute Return minutes as an integer
3 second Return seconds as an integer

Converting a timestamp into a categorical numeric used for grouping is a very common task. Again, many of the functions that we talk about today have their origins in the Relational Database Management Systems (RDBMS).

Compose Date, Time, or Intervals

Some older systems, such as the mainframe, might store the date as three separate numeric values. How can we compose a date or timestamp from a set of numeric values? The example below shows how to use the two make functions to create a date and timestamp out of numeric values.

%sql
--
-- 6 - compose date, time or interval from numerics
-- 
select
  make_date(1978, 9, 6) as the_date,
  make_timestamp(1978, 9, 6, 7, 30, 15.887) as the_timestamp

The output below is from the above query.

Spark SQL - Date + Time Functions - Compose Date or Time

Composing a date or timestamp from numeric values is available in most relational database systems. What is not that common is the interval functions. What is an interval data type? It is a JSON representation of a given date/time.

The example below creates date, year/month, and regular interval data types.

%sql
select
  make_dt_interval(1978, 9, 6, 01.001001) as dt_interval,
  make_ym_interval(1978, 9) as ym_interval,
  make_interval(1978, 9, 6, 7, 30, 15, 0.887) as the_interval;

The screenshot below was taken from the output of the above query.

Spark SQL - Date + Time Functions - Make Date / Time Intervals

Please see the table below that has links to the functions that were explored in this simple query.

Example Function Description
1 make_date Return a date object
2 make_timestamp Return timestamp object
3 make_dt_interval Return date interval object
4 make_ym_interval Return year/month interval object
5 make_interval Return interval object

The make functions are handy when you have dates or times stored as numeric values.

Date Calculations

Various business problems might need to know the following: the number of months between two dates, the number of days between two dates, the number of years between two dates, what day of the week is a date, what is the next day, and what is the last day of the month. There are Spark SQL functions that can help you solve these problems.

The example query below explores these questions using my birthday and my sister's birthday.

%sql
--
-- 7 – date calculations
-- 
SELECT 
  abs(months_between('1968-09-05', '1978-09-06')) as month_diff_1,
  datediff('1978-09-06', '1968-09-05') / 30.437 as month_diff_2,
  date_diff('1978-09-06', '1968-09-05') / 365.25 as year_diff,
  date_format('1968-09-05', 'E') as birth_day,
  next_day('1968-09-05', 'TU') as first_tue,
  last_day('1968-09-05') as end_of_month

The screenshot below shows the output of the date calculations. We estimate the difference between months and years using an average number of days. Yes, we are off a couple of hundredths between the first two numbers. But usually, that is good enough for most use cases.

Spark SQL - Date + Time Functions - make date calculations

Please see the table below for links to the functions explored in this simple query.

Example Function Description
1 months_between Returns a double
2 datediff Returns an integer
3 date_diff Returns a big integer
4 date_format Returns a string
5 next_day Returns a date
6 last_day Returns a date

Calculating date differences is a great way to find orders that take the longest number of days to ship.

Date Arithmetic

There are three functions to perform addition and subtraction with dates: add months, date add, and date sub.

%sql
--
-- 8 - add or sub dates
--
  select
    add_months(date(now()), 1) as next_month1,
    date_add(curdate(), 1) as next_day1,
    date_sub(current_date(), 1) as prev_day1

The screenshot below shows the output of the date arithmetic.

Spark SQL - Date + Time Functions - Date Arthimetic

Please see the table below for links to the functions explored in this simple query.

Example Function Description
1 add_months Returns a date
2 date_add Returns a timestamp
3 date_sub Returns a timestamp

The Adventure Works company wants to follow up with each customer order one week after an item is shipped. We can easily calculate the callback date using the date add function.

Date Parts vs. Date Truncation

There might be a business requirement to return part of a date or truncate a date/timestamp to the nearest part. The query below explores these problems.

%sql
--
-- 9 - date parts and date truncate
-- 
%sql
  select
     date_part('YEAR', current_date()) as the_year1,
     datepart('YEAR', current_date()) as the_year2,
     current_timestamp() as the_date_time,
     date_trunc('YEAR', current_timestamp()) as the_year_start,
     date_trunc('MONTH', current_timestamp()) as the_month_start,
     date_trunc('DAY', current_timestamp()) as the_day_start,
     date_trunc('HOUR', current_timestamp()) as the_hour_start

The output from executing the above Spark SQL statement is seen below.

Spark SQL - Date + Time Functions - date parts + date truncate

Please see the table below for links to the functions explored in this simple query.

Example Function Description
1 date_part Returns a decimal or integer
2 datepart Returns a decimal or integer
3 date_trunc Returns a timestamp

The date part function is a more complex way to convert a timestamp to a numerical category. The date truncate function comes in handy when we do not care about certain parts of a date. For instance, return the current date and hour that an event occurred.

Unix Date/Time

The Unix operating system became available to companies in 1970. Therefore, it was natural for the system creators to count the number of seconds from January 1, 1970. In this section, we will talk about seven functions that you can use with these date/time values.

%sql
--
-- 10 – working with Unix time
-- 
select 
   unix_date(date('1968-09-05')) as unix_days1,
   unix_micros(timestamp('1968-09-05')) as unix_micros1,
   unix_millis(timestamp('1968-09-05')) as unix_millis1,
   unix_seconds(timestamp('1968-09-05')) as unix_secs1,
   unix_timestamp('1970-01-01', 'yyyy-MM-dd') as unix_ts1,
   date_from_unix_date(1) as unix_ts2,
   from_unixtime(0) as unix_ts3

The first four calculations are based on my birthday, 483 days before the Unix epoch. Therefore, these numbers are negative. Of course, asking for the number of seconds from the epoch is always zero. The last two functions convert from a numeric to a date or timestamp. The output from executing the above Spark SQL statement is seen below.

Spark SQL - Date + Time Functions - working with Unix timestamps

Please see the table below for links to the functions explored in this simple query.

Example Function Description
1 unix_date Returns an integer
2 unix_micros Returns a big integer
3 unix_millis Returns a big integer
4 unix_seconds Returns a big integer
5 unix_timestamp Returns a big integer
6 date_from_unix_date Returns a date
7 from_unixtime Returns a string

Here is some food for thought: Since a Unix timestamp is the number of seconds from an epoch, we can use arithmetic to get the number of seconds between two timestamps, then use the correct divisor to convert to days, weeks, months, or years.

Unix To Timestamp

What happens if we capture the number of seconds since January 1, 1970, in a data file and then want to know the current date from this field? There are three functions to convert seconds, milliseconds, and microseconds to a timestamp.

%sql
--
-- 11 – Unix time to timestamp
-- 
select 
  timestamp_micros(unix_micros(timestamp('1968-09-05 07:30:15'))) as ts1,
  timestamp_millis(unix_millis(timestamp('1968-09-05 07:45:30'))) as ts2,
  timestamp_seconds(unix_seconds(timestamp('1968-09-05 08:00:15'))) as ts3

The output from executing the above Spark SQL statement is seen below. I did change the time slightly on the timestamps to make sure the functions work.

Spark SQL - Date + Time Functions - Unix time to timestamp

Please see the table below for links to the functions explored in this simple query.

Example Function Description
1 timestamp_micros Returns a timestamp
2 timestamp_millis Returns a timestamp
3 timestamp_seconds Returns a timestamp

In short, we can convert from and to Unix timestamps stored in seconds, milliseconds, and microseconds.

Working With Intervals

The last way to perform arithmetic with date/time values is to use intervals. The keywords TIMESTAMP and INTERVAL are used in the example below. We are taking a string and converting it to a time stamp. Next, we add the interval to the given timestamp to arrive at our adjusted value as a timestamp.

%sql
--
-- 12 - interval examples
-- 
SELECT 
  TIMESTAMP '1968-09-05 07:30:15' + INTERVAL 3 YEARS AS birth_plus_3yy,
  TIMESTAMP '1968-09-05 07:30:15' + INTERVAL 3 MONTHS AS birth_plus_3mm,
  TIMESTAMP '1968-09-05 07:30:15' + INTERVAL 3 DAYS AS birth_plus_3dd,
  TIMESTAMP '1968-09-05 07:30:15' + INTERVAL 3 HOURS AS birth_plus_3hr,
  TIMESTAMP '1968-09-05 07:30:15' + INTERVAL 3 MINUTES AS birth_plus_3mi,
  TIMESTAMP '1968-09-05 07:30:15' + INTERVAL 3 SECONDS AS birth_plus_3ss

The image below shows my birthday adjusted by three years, three months, three days, three hours, three minutes, and three seconds.

Spark SQL - Date + Time Functions - date time calculations using intervals

The Spark SQL language supports the addition and subtraction of an interval from a timestamp. This is a great way to calculate future or past events. Please refer to the online documentation for more details.

Capstone Project

The previous sections were focused on the syntax of all the date/time functions supported by Spark SQL. Now, we have a capstone project to complete. How can we calculate the total shipment value by workday for January 2011? That way, we can ensure we have enough staff in the shipment department for high-demand days.

I am going to break this solution into two parts. We want to pivot the data by week for a given year and month. It is important to note that the PIVOT keyword is not supported with delta live tables. Therefore, it is important to know how to bucket data using straight SQL syntax.

In the first part of the solution, we will create a view with the data placed into the correct day bucket. I wanted to introduce the reader to both the case and if statements. I am using the case statement to convert a month represented as an integer to a string value. This can be done more efficiently with the date format function in one line of code. Please see the field called ShipMonthShort. The if statement is used to bucket the sales by a given workday (ship date) column.

--
-- 13 - Create view for ship data
--
 
%sql
create or replace view star.vw_ship_data as
  select 
    year(ShipDate) as ShipYear,
    month(ShipDate) as ShipOrder,
 
    date_format(ShipDate, 'MMM') as ShipMonthShort,    
    case 
        when month(ShipDate) == 1 then 'Jan'
        when month(ShipDate) == 2 then 'Feb'
        when month(ShipDate) == 3 then 'Mar'
        when month(ShipDate) == 4 then 'Apr'
        when month(ShipDate) == 5 then 'May'
        when month(ShipDate) == 6 then 'Jun'
        when month(ShipDate) == 7 then 'Jul'
        when month(ShipDate) == 8 then 'Aug'
        when month(ShipDate) == 9 then 'Sep'
        when month(ShipDate) == 10 then 'Oct'
        when month(ShipDate) == 11 then 'Nov'
        when month(ShipDate) == 12 then 'Dec'
        else 'unknown'
    end as ShipMonthLong,
 
    if(dayofweek(ShipDate) == 1, ExtendedAmount, 0) as SundaySale,
    if(dayofweek(ShipDate) == 2, ExtendedAmount, 0) as MondaySale,
    if(dayofweek(ShipDate) == 3, ExtendedAmount, 0) as TuesdaySale,
    if(dayofweek(ShipDate) == 4, ExtendedAmount, 0) as WednesdaySale,
    if(dayofweek(ShipDate) == 5, ExtendedAmount, 0) as ThursdaySale,
    if(dayofweek(ShipDate) == 6, ExtendedAmount, 0) as FridaySale,
    if(dayofweek(ShipDate) == 7, ExtendedAmount, 0) as SaturdaySale
 
  from 
    star.fact_internet_sales

The image below shows the output from executing a select statement against the newly created view.

Spark SQL - Date + Time Functions - using if statement to bucket sales

The second part of the solution is to aggregate and filter the data. The query below solves our capstone problem.

--
-- 14 - Create view for ship data
--
 
%sql
select 
  ShipYear,
  ShipMonthShort,
  Sum(SundaySale) as SundaySales,
  Sum(MondaySale) as MondaySales,
  Sum(TuesdaySale) as TuesdaySales,
  Sum(WednesdaySale) as WednesdaySales,
  Sum(ThursdaySale) as ThursdaySales,
  Sum(FridaySale) as FridaySales,
  Sum(SaturdaySale) as SaturdaySales
from 
  star.vw_ship_data
where
  ShipYear = 2011 and ShipMonthShort = 'Jan'
group by
  ShipYear,
  ShipOrder,
  ShipMonthShort
order by
  ShipYear,
  ShipOrder

The output below shows the unfiltered data sorted by year and month.

Spark SQL - Date + Time Functions - Aggregating sales by year, month and day of week.

The cliché of a "picture is worth a thousand words" can be used now. Let's create a bar graph and determine which days we need more staff.

Spark SQL - Date + Time Functions - displaying manually pivoted data using a bar graph

We can see that Sunday had the most products in terms of dollars being shipped from the Adventure Works warehouse. However, Wednesday came in as a close second. The least active day in the shipping department in January is Saturday.

Summary

Many functions deal with dates, timestamps, and time zones. A given occurrence (event) can be captured as a string, an internal date, or a timestamp object. We even learned that Unix systems have kept track of time as the number of seconds since January 1, 1970. Older systems like mainframes might have stored dates or times as separate integers for each category.

The functions in this article can be used by a big data developer to convert incoming file data into a modern format, such as date or timestamp, before storing it in a delta table. When creating summarized information for reporting, the developer might want to categorize the data before aggregating. Looking at every single order is very boring. However, the summarized sales value by shipping day interests the shipping manager. Now, he can staff appropriately for the busiest days of the week.

A good understanding of manipulating the basic data types (date/time, numbers, and strings) is a very important skill for the data engineer. Enclosed is the notebook used in this article. Next time, we will focus on working with numeric functions. Since many of these functions are algebraic in nature, we will be solving some math problems.

Next Steps
  • Learn about numeric functions.
  • Learn about string functions.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author John Miner John Miner is a Data Architect at Insight Digital Innovation helping corporations solve their business needs with various data platform solutions.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2023-11-17

Comments For This Article