Time Series Forecasting using Python Prophet

Problem

I have a dataset representing a time series. There is one index column containing a timestamp and one target column containing a numerical target variable signifying my odometer or the number of kilometers I have driven in total over each period. Using this historical data, how can I predict the kilometers (km) I will accumulate over the following weeks? Knowing how many kilometers I will drive in the future will help me better plan how to service my vehicle.

Solution

We can use a Python library for time series forecasting to predict a target variable over a period. While there are many to choose from, we will focus on the Prophet library today.

Environment

Begin by creating a project folder and opening it using VS code. Then, create a requirements.txt file containing four lines: pandas, ipykernel, prophet, and statsmodels. Next, hit Ctrl+Shift+P and choose Python: Create Environment.

python create environment

Follow the prompts for creating a local virtual environment. Make sure to check requirements.txt so the environment agent will install the required Python packages directly:

python environment requirements

Finally, create a .ipynb file for the experiment. Make sure to select the kernel of the existing project environment:

create a jupyter notebook file

Dataset Overview

Let us start by checking out the data. I’ll use a CSV file containing fueling data for the past three years. The file has two columns:

  • Date: date of fueling.
  • Odometer: total kilometers at the time of the fueling.

The difference between each consecutive row equals the total distance driven for that period. For example, over the fourth and fifth of March, the distance covered was 68 688 – 68 174 = 514 km. Here is the data preview:

import pandas as pd
 
df = pd.read_csv('fuelings_data.csv')
df
dataset preview

Transformations

Next, we need to transform these data a bit. We must convert the Date column from the default object type to pandas datetime. Then, we must rename the columns and set the timestamp column as an index column:

df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
 
# prophet requires columns ds and y
df = df.rename(columns={'Date': 'ds', 'Odometer': 'y'})
# but we need an index for resampling
df = df.set_index('ds')
data transformations

Resampling

The next step is resampling. The intervals in our data set are not equal. We may have more than one fueling per day or fuelings with more than a week between them. To build a better model, it is necessary to resample the data into equal frequencies between the data points. Resampling per se and the intricacies behind it are out of the scope of this article. However, in this case, I already assessed that I get a better model with resampling rather than using the data directly. Resampling is straightforward with the powerful pandas resample function:

df_resampled = df.resample('W').mean().interpolate(method='linear')
pandas resample to weekly frequency

This code:

  • Instructs the dataframe to be resampled at equal weekly intervals, stipulated by the ‘W’.
  • Calls the mean() function to get the mean over every interval in case there are multiple values per week.
  • Interpolates using linear interpolation to fill in the missing values when there is not enough data. We will estimate the missing values based on two neighboring temporal points. This is because some of the periods in the original dataset are not equally spaced between each other by a week but by some longer period.

Data Profile

Next, let’s grab a quick data profile of the data so far.

df_resampled.describe()
data profile of the resampled dataframe

We see:

  • In red: the total amount of data points. Notice that, due to resampling, it has increased from 98 to 157.
  • In blue: the starting odometer reading.
  • In green: the end odometer reading. Notice it has changed from 68 688 to 68 431 km. This change is again due to the resampling and averaging over the existing interval.

One last point: let us reset the dataframe index. Resampling required a temporal index; however, the Prophet library expects only two columns named ds and y with no dataframe index.

df_resampled = df_resampled.reset_index()
pandas reset index

Prediction

Now to the meat and potatoes of this article: making a prediction. Here is the code:

01: from prophet import Prophet
02: 
03: m = Prophet()
04: m.fit(df_resampled)
05: 
06: future = m.make_future_dataframe(periods=12, freq='W')  # Forecasting for the next 12 weeks
07: forecast = m.predict(future)
08: 
09: fig = m.plot(forecast,
10:                  include_legend=True)

Let’s break it down:

  • 01: Import the Prophet object for the model.
  • 03: Make an instance of the Prophet model.
  • 04: Fit the model to the data.
  • 06: Make a dataframe for the future periods, in this case, 12 weeks.
  • 07: Forecast the future for 12 weeks ahead using the model.
  • 09: Plot the forecast.

The result is the following line chart plotting the observed data points, the forecast, and the uncertainty interval:

timeseries forecast with FB prophet

The uncertainty interval seems narrow. Therefore, the model is good, or at least not too bad, considering we are plug-and-playing. However, we should not forget that the data is real-world and high-quality.

Cross Validation

To get a better understanding of the performance of our model, let us cross-validate it. The cross validation method will cut the data into chunks that we specified (the period parameter) and predict a future period (the horizon parameter) for every chunk:

from prophet.diagnostics import cross_validation
 
df_cv = cross_validation(m,
                         period= '28 days', # 4 weeks 
                         horizon = '28 days')

The output of the cross validation is a dataframe containing the true values (y) and the sampled forecast values yhat, y_lower, and y_upper:

prophet timeseries model cross validation

Having cross validated the model, we can proceed to computing statistics on the model performance:

from prophet.diagnostics import performance_metrics
 
df_p = performance_metrics(df_cv)
model performance metrics

The statistics computed are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE), median absolute percent error (MDAPE), and coverage of the yhat_lower and yhat_upper estimates. We are interested in MAE (unit is kilometers) and MAPE (unit is a percentage). We see that for a short forecast period (1-2 weeks), the error percentage is below 1; up to a month (4 weeks or 28 days), it is about 1.6%. These values are valid in the context of this model only. They give an accurate idea of what odometer readings to expect in the coming periods.

Plotting the Evaluation Metrics

Finally, we are ready to plot the evaluation results of the model to make sense of everything so far.

from prophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape', rolling_window=0.1)
plot of model performance metrics

As pointed out already, we see that for up to two weeks ahead, errors of up to 1% are common. For predictions further in the future, the error may approach 1.6–1.7%.

Conclusion

As seen in the prediction plot, two points of interest stand out. Sometime in the beginning of April, the odometer will reach the 70 000 km mark. Assuming the next oils service is due at 75,000 km, it is safe to assume that I should have the vehicle serviced in June unless a drastic change in driving habits occurs.

points of interest in the plot

Next Steps

2 Comments

  1. Hi Siraj,

    Thanks for your comment. Here is the data. Just paste it to an empty excel file (I hope the tabs will be preserved). The date format is dd.mm.yyyy.

    DateOdometer
    05.03.202468688
    04.03.202468174
    02.03.202468078
    01.03.202467549
    12.02.202467088
    27.01.202466635
    27.01.202466208
    21.01.202465837
    20.01.202465309
    06.01.202465017
    06.01.202464455
    17.12.202363941
    16.12.202363418
    25.11.202362938
    25.11.202362471
    18.11.202362205
    29.10.202361686
    18.10.202361248
    17.10.202360704
    14.10.202360213
    13.10.202359673
    12.10.202359064
    06.10.202358792
    29.09.202358219
    15.09.202357759
    02.09.202357147
    02.09.202356574
    01.09.202356030
    29.08.202355516
    28.08.202354927
    24.08.202354473
    19.08.202354166
    02.08.202353783
    01.08.202353239
    30.07.202352691
    29.07.202352152
    27.07.202351552
    22.07.202351480
    25.06.202350898
    18.06.202350424
    01.06.202350056
    24.05.202349436
    23.05.202348928
    22.05.202348344
    21.05.202348161
    18.05.202347554
    18.05.202347124
    17.05.202346616
    17.05.202346037
    10.05.202345773
    11.04.202345338
    21.03.202344942
    24.02.202344392
    21.01.202343849
    10.12.202243481
    30.11.202243169
    31.10.202242398
    29.10.202242332
    22.10.202241713
    28.09.202241105
    14.08.202240373
    08.08.202240095
    24.07.202239530
    09.07.202238893
    29.06.202238530
    04.06.202238218
    27.05.202237731
    30.04.202237182
    15.04.202236773
    26.03.202236345
    12.03.202235828
    19.02.202235459
    04.02.202234993
    29.11.202134269
    21.11.202134006
    13.11.202133502
    31.10.202132604
    28.09.202131844
    10.09.202130979
    09.09.202130468
    09.09.202129948
    08.09.202129587
    07.09.202129203
    03.09.202128753
    20.08.202128195
    17.08.202127705
    13.08.202127277
    08.08.202127055
    07.08.202126541
    07.08.202126056
    31.07.202124935
    31.07.202124869
    27.06.202124327
    23.06.202123830
    05.06.202123555
    13.05.202123058
    05.04.202122554
    14.03.202121993

Leave a Reply

Your email address will not be published. Required fields are marked *