Time Series Forecasting using Python Prophet

By:   |   Updated: 2024-04-22   |   Comments (2)   |   Related: > Python


Problem

I have a dataset representing a time series. There is one index column containing a timestamp and one target column containing a numerical target variable signifying my odometer or the number of kilometers I have driven in total over each period. Using this historical data, how can I predict the kilometers (km) I will accumulate over the following weeks? Knowing how many kilometers I will drive in the future will help me better plan how to service my vehicle.

Solution

We can use a Python library for time series forecasting to predict a target variable over a period. While there are many to choose from, we will focus on the Prophet library today.

Environment

Begin by creating a project folder and opening it using VS code. Then, create a requirements.txt file containing four lines: pandas, ipykernel, prophet, and statsmodels. Next, hit Ctrl+Shift+P and choose Python: Create Environment.

python create environment

Follow the prompts for creating a local virtual environment. Make sure to check requirements.txt so the environment agent will install the required Python packages directly:

python environment requirements

Finally, create a .ipynb file for the experiment. Make sure to select the kernel of the existing project environment:

create a jupyter notebook file

Dataset Overview

Let us start by checking out the data. I'll use a CSV file containing fueling data for the past three years. The file has two columns:

  • Date: date of fueling.
  • Odometer: total kilometers at the time of the fueling.

The difference between each consecutive row equals the total distance driven for that period. For example, over the fourth and fifth of March, the distance covered was 68 688 – 68 174 = 514 km. Here is the data preview:

import pandas as pd
 
df = pd.read_csv('fuelings_data.csv')
df
dataset preview

Transformations

Next, we need to transform these data a bit. We must convert the Date column from the default object type to pandas datetime. Then, we must rename the columns and set the timestamp column as an index column:

df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
 
# prophet requires columns ds and y
df = df.rename(columns={'Date': 'ds', 'Odometer': 'y'})
# but we need an index for resampling
df = df.set_index('ds')
data transformations

Resampling

The next step is resampling. The intervals in our data set are not equal. We may have more than one fueling per day or fuelings with more than a week between them. To build a better model, it is necessary to resample the data into equal frequencies between the data points. Resampling per se and the intricacies behind it are out of the scope of this article. However, in this case, I already assessed that I get a better model with resampling rather than using the data directly. Resampling is straightforward with the powerful pandas resample function:

df_resampled = df.resample('W').mean().interpolate(method='linear')
pandas resample to weekly frequency

This code:

  • Instructs the dataframe to be resampled at equal weekly intervals, stipulated by the 'W'.
  • Calls the mean() function to get the mean over every interval in case there are multiple values per week.
  • Interpolates using linear interpolation to fill in the missing values when there is not enough data. We will estimate the missing values based on two neighboring temporal points. This is because some of the periods in the original dataset are not equally spaced between each other by a week but by some longer period.

Data Profile

Next, let's grab a quick data profile of the data so far.

df_resampled.describe()
data profile of the resampled dataframe

We see:

  • In red: the total amount of data points. Notice that, due to resampling, it has increased from 98 to 157.
  • In blue: the starting odometer reading.
  • In green: the end odometer reading. Notice it has changed from 68 688 to 68 431 km. This change is again due to the resampling and averaging over the existing interval.

One last point: let us reset the dataframe index. Resampling required a temporal index; however, the Prophet library expects only two columns named ds and y with no dataframe index.

df_resampled = df_resampled.reset_index()
pandas reset index

Prediction

Now to the meat and potatoes of this article: making a prediction. Here is the code:

01: from prophet import Prophet
02: 
03: m = Prophet()
04: m.fit(df_resampled)
05: 
06: future = m.make_future_dataframe(periods=12, freq='W')  # Forecasting for the next 12 weeks
07: forecast = m.predict(future)
08: 
09: fig = m.plot(forecast,
10:                  include_legend=True)

Let's break it down:

  • 01: Import the Prophet object for the model.
  • 03: Make an instance of the Prophet model.
  • 04: Fit the model to the data.
  • 06: Make a dataframe for the future periods, in this case, 12 weeks.
  • 07: Forecast the future for 12 weeks ahead using the model.
  • 09: Plot the forecast.

The result is the following line chart plotting the observed data points, the forecast, and the uncertainty interval:

timeseries forecast with FB prophet

The uncertainty interval seems narrow. Therefore, the model is good, or at least not too bad, considering we are plug-and-playing. However, we should not forget that the data is real-world and high-quality.

Cross Validation

To get a better understanding of the performance of our model, let us cross-validate it. The cross validation method will cut the data into chunks that we specified (the period parameter) and predict a future period (the horizon parameter) for every chunk:

from prophet.diagnostics import cross_validation
 
df_cv = cross_validation(m,
                         period= '28 days', # 4 weeks 
                         horizon = '28 days')

The output of the cross validation is a dataframe containing the true values (y) and the sampled forecast values yhat, y_lower, and y_upper:

prophet timeseries model cross validation

Having cross validated the model, we can proceed to computing statistics on the model performance:

from prophet.diagnostics import performance_metrics
 
df_p = performance_metrics(df_cv)
model performance metrics

The statistics computed are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE), median absolute percent error (MDAPE), and coverage of the yhat_lower and yhat_upper estimates. We are interested in MAE (unit is kilometers) and MAPE (unit is a percentage). We see that for a short forecast period (1-2 weeks), the error percentage is below 1; up to a month (4 weeks or 28 days), it is about 1.6%. These values are valid in the context of this model only. They give an accurate idea of what odometer readings to expect in the coming periods.

Plotting the Evaluation Metrics

Finally, we are ready to plot the evaluation results of the model to make sense of everything so far.

from prophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape', rolling_window=0.1)
plot of model performance metrics

As pointed out already, we see that for up to two weeks ahead, errors of up to 1% are common. For predictions further in the future, the error may approach 1.6–1.7%.

Conclusion

As seen in the prediction plot, two points of interest stand out. Sometime in the beginning of April, the odometer will reach the 70 000 km mark. Assuming the next oils service is due at 75,000 km, it is safe to assume that I should have the vehicle serviced in June unless a drastic change in driving habits occurs.

points of interest in the plot
Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Hristo Hristov Hristo Hristov is a Data Scientist and Power Platform engineer with more than 12 years of experience. Between 2009 and 2016 he was a web engineering consultant working on projects for local and international clients. Since 2017, he has been working for Atlas Copco Airpower in Flanders, Belgium where he has tackled successfully multiple end-to-end digital transformation challenges. His focus is delivering advanced solutions in the analytics domain with predominantly Azure cloud technologies and Python. Hristo's real passion is predictive analytics and statistical analysis. He holds a masters degree in Data Science and multiple Microsoft certifications covering SQL Server, Power BI, Azure Data Factory and related technologies.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2024-04-22

Comments For This Article




Tuesday, April 23, 2024 - 3:15:03 PM - Hristo Hristov Back To Top (92196)
Hi Siraj,

Thanks for your comment. Here is the data. Just paste it to an empty excel file (I hope the tabs will be preserved). The date format is dd.mm.yyyy.

Date Odometer
05.03.2024 68688
04.03.2024 68174
02.03.2024 68078
01.03.2024 67549
12.02.2024 67088
27.01.2024 66635
27.01.2024 66208
21.01.2024 65837
20.01.2024 65309
06.01.2024 65017
06.01.2024 64455
17.12.2023 63941
16.12.2023 63418
25.11.2023 62938
25.11.2023 62471
18.11.2023 62205
29.10.2023 61686
18.10.2023 61248
17.10.2023 60704
14.10.2023 60213
13.10.2023 59673
12.10.2023 59064
06.10.2023 58792
29.09.2023 58219
15.09.2023 57759
02.09.2023 57147
02.09.2023 56574
01.09.2023 56030
29.08.2023 55516
28.08.2023 54927
24.08.2023 54473
19.08.2023 54166
02.08.2023 53783
01.08.2023 53239
30.07.2023 52691
29.07.2023 52152
27.07.2023 51552
22.07.2023 51480
25.06.2023 50898
18.06.2023 50424
01.06.2023 50056
24.05.2023 49436
23.05.2023 48928
22.05.2023 48344
21.05.2023 48161
18.05.2023 47554
18.05.2023 47124
17.05.2023 46616
17.05.2023 46037
10.05.2023 45773
11.04.2023 45338
21.03.2023 44942
24.02.2023 44392
21.01.2023 43849
10.12.2022 43481
30.11.2022 43169
31.10.2022 42398
29.10.2022 42332
22.10.2022 41713
28.09.2022 41105
14.08.2022 40373
08.08.2022 40095
24.07.2022 39530
09.07.2022 38893
29.06.2022 38530
04.06.2022 38218
27.05.2022 37731
30.04.2022 37182
15.04.2022 36773
26.03.2022 36345
12.03.2022 35828
19.02.2022 35459
04.02.2022 34993
29.11.2021 34269
21.11.2021 34006
13.11.2021 33502
31.10.2021 32604
28.09.2021 31844
10.09.2021 30979
09.09.2021 30468
09.09.2021 29948
08.09.2021 29587
07.09.2021 29203
03.09.2021 28753
20.08.2021 28195
17.08.2021 27705
13.08.2021 27277
08.08.2021 27055
07.08.2021 26541
07.08.2021 26056
31.07.2021 24935
31.07.2021 24869
27.06.2021 24327
23.06.2021 23830
05.06.2021 23555
13.05.2021 23058
05.04.2021 22554
14.03.2021 21993

Tuesday, April 23, 2024 - 5:02:57 AM - Siraj Sirajudeen Back To Top (92195)
Can you add the Source data link too?














get free sql tips
agree to terms