fbpx
Simplifying Time Series Forecasting: Replicating Monsaraida’s Solution on Kaggle for Retail Volume Predictions Simplifying Time Series Forecasting: Replicating Monsaraida’s Solution on Kaggle for Retail Volume Predictions
The M5 Competition, hosted on Kaggle, has recently drawn attention to the effectiveness of gradient-boosting methods for volume forecasts of retail... Simplifying Time Series Forecasting: Replicating Monsaraida’s Solution on Kaggle for Retail Volume Predictions

The M5 Competition, hosted on Kaggle, has recently drawn attention to the effectiveness of gradient-boosting methods for volume forecasts of retail products. In this article, we will focus on the accuracy track of the competition and tackle a time series problem. By replicating one of the top-ranking, yet simplest and most clear solutions proposed by Monsaraida (Masanori Miyahara), a Japanese computer scientist, we aim to equip our readers with code and ideas to successfully handle future forecasting competitions on Kaggle. While many solutions are available on the competition’s discussion pages, Monsaraida’s solution stands out for its simplicity and effectiveness, ranking fourth on the private leaderboard with a score of 0.53583. Notably, this solution utilizes general features without prior selection, such as sales statistics, calendars, prices, and identifiers.

This article is an excerpt from the book The Kaggle Workbook by Konrad Banachewicz and Luca Massaron, a practical workbook based on hands-on exercises that can help you start thinking like an experienced data scientist.

Computing predictions for specific dates and time horizons

The plan for replicating Monsaraida’s solution is to create a notebook customizable by input parameters to produce the necessary processed data for training and test datasets and the LightGBM models for predictions. The models, given data in the past, will be trained to learn to predict values in a specific number of days in the future. The best results can be obtained by having each model learn to predict the values in a specific week range in the future. Since we have to predict up to 28 days ahead, we need a model predicting from day +1 to day +7 in the future, then another one able to predict from day +8 to day +14, another from day +15 to +21, and finally, another one capable of handling predictions from day +22 to day +28. We will need a Kaggle notebook for each of these time ranges, thus we need four notebooks. Each of these notebooks will be trained to predict the future time span for each of the 10 stores that were part of the competition. In total, each notebook will produce ten models. Altogether, the notebooks will then produce 40 models covering all the future ranges and all the stores.

Since we need to predict both for the public leaderboard and for the private one, it is necessary to repeat this process twice, stopping training at day 1,913 (predicting days from 1,914 to 1,941) for the public test set submission, and at day 1,941 (predicting days from 1,942 to 1,969) for the private one.

Given the current limitations for running Kaggle notebooks based on CPU, all these eight notebooks can be run in parallel (the whole process takes almost 6 and a half hours). Each notebook can be distinguishable by others by its name, containing the parameters’ values relative to the last training day and the look-ahead horizon in days. An example of one of these notebooks can be found at https://www.kaggle.com/code/lucamassaron/m5-train-day-1941-horizon-7.

Let’s now examine together how the code has been arranged and what we can learn from Monsaraida’s solution.

We simply start by importing the necessary packages. You can just notice how, apart from NumPy and pandas, the only data science specialized package is LightGBM. You may also notice that we are going to use gc (garbage collection): that’s because we need to limit the amount of memory used by the script, and we frequently just collect and recycle the unused memory. As part of this strategy, we also frequently store models and data structures on disk, instead of keeping them in memory:

import numpy as np
import pandas as pd
import os
import random
import math
from decimal import Decimal as dec
import datetime
import time
import gc
import lightgbm as lgb
import pickle

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

As part of the strategy to limit memory usage, we resort to the function to reduce the pandas DataFrame memory footprint, described in the Kaggle book and initially developed by Arjan Groen during the Zillow competition (read the discussion at https://www.kaggle.com/competitions/tabular-playground-series-dec-2021/discussion/291844):

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32',
'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.
int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max <
np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max <
np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max <
np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float32).min and c_max <
np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}%
reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
   return df

We keep on defining functions for this solution, because splitting the solution into smaller parts helps and because it is easier to clean up all the used variables when you just return from a function (you keep only what you saved to disk). Our next function helps us to load all the data available and compress it:

def load_data():
    train_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-
accuracy/sales_train_evaluation.csv"))
    prices_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-
accuracy/sell_prices.csv"))
    calendar_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-
accuracy/calendar.csv"))
    submission_df = reduce_mem_usage(pd.read_csv("../input/m5-forecastingaccuracy/
sample_submission.csv"))
    return train_df, prices_df, calendar_df, submission_df

Once the function has been defined, we run it:

train_df, prices_df, calendar_df, submission_df = load_data()

After preparing the code to retrieve the data relative to prices, volumes, and calendar information, we proceed to prepare the first processing function that will have the role to create a basic table of information having item_id, dept_id, cat_id, state_id, and store_id as row keys, a day column, and a values column containing the volumes. This is achieved starting from rows having all the days’ data columns by using the pandas command melt (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html).

The command takes as reference the index of the DataFrame and then picks all the remaining features, placing their name on a column and their value on another one (var_name and value_name parameters help you defi ne the name of these new columns). In this way, you can unfold a row representing the sales series of a certain item in a certain store into multiple rows, each one representing a single day. The fact that the positional order of the unfolded columns is preserved guarantees that now your time series spans on the vertical axis (you can therefore apply furthermore transformations on it, such as moving means).

To give you an idea of what is happening, here is the train_df before the transformation with pd.melt. Notice how the volumes of the distinct days are column features:

Figure 1: The training DataFrame

After the transformation, you obtain a grid_df, where the columns have been converted into rows, and days are now to be found under a new column:

Figure 2: Applying pd.melt to the training DataFrame

The feature d contains the reference to the columns that are not part of the index, in essence, all the features from d_1 to d_1935. This implies an increase of the number of rows in the dataset by 1,935 fold. By simply removing the d_ prefix from its values and converting them to integers, you now have a day feature.

Apart from this, the code snippet also separates a holdout of the rows. Such a holdout is your validation set. The validation strategy is based on reserving a part of the training data, based on time. On the training part, it will also add the rows necessary for your predictions based on the prediction horizon (the number of days you want to predict in the future) you provide.

Here is the function that creates our basic feature template. As input, it takes the train_df Data- Frame, the number of the day the training ends, and the prediction horizon:

def generate_base_grid(train_df, end_train_day_x, predict_horizon):
    index_columns = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id',
'state_id']

    grid_df = pd.melt(train_df, id_vars=index_columns, var_name='d',
value_name='sales')
    grid_df = reduce_mem_usage(grid_df, verbose=False)

    grid_df['d_org'] = grid_df['d']
    grid_df['d'] = grid_df['d'].apply(lambda x: x[2:]).astype(np.int16)

    time_mask = (grid_df['d'] > end_train_day_x) & (grid_df['d'] <= end_
train_day_x + predict_horizon)
    holdout_df = grid_df.loc[time_mask, ["id", "d", "sales"]].reset_
index(drop=True)
    holdout_df.to_feather(f"holdout_df_{end_train_day_x}_to_{end_train_
day_x + predict_horizon}.feather")
    del(holdout_df)
    gc.collect()

    grid_df = grid_df[grid_df['d'] <= end_train_day_x]
    grid_df['d'] = grid_df['d_org']
    grid_df = grid_df.drop('d_org', axis=1)

    add_grid = pd.DataFrame()
    for i in range(predict_horizon):
        temp_df = train_df[index_columns]
        temp_df = temp_df.drop_duplicates()
        temp_df['d'] = 'd_' + str(end_train_day_x + i + 1)
        temp_df['sales'] = np.nan
        add_grid = pd.concat([add_grid, temp_df])
grid_df = pd.concat([grid_df, add_grid])
grid_df = grid_df.reset_index(drop=True)

        for col in index_columns:
            grid_df[col] = grid_df[col].astype('category')

        grid_df = reduce_mem_usage(grid_df, verbose=False)
        grid_df.to_feather(f"grid_df_{end_train_day_x}_to_{end_train_day_x +
predict_horizon}.feather")
        del(grid_df)
        gc.collect()

After handling the function to create the basic feature template, we prepare a merge function for pandas DataFrames that will help to save memory space and avoid memory errors when handling large sets of data. Given two DataFrames, df1 and df2, and the set of foreign keys we need to be merged, the function applies a left outer join between df1 and df2 without creating a new merged object but simply expands the existent df1 DataFrame.

The function works first by extracting the foreign keys from df1, then merging the extracted keys with df2. In this way, the function creates a new DataFrame, called merged_gf, which is ordered as df1. At this point, we just assign the merged_gf columns to df1. Internally, df1 will pick the reference to the internal data structures from merged_gf. Such an approach helps minimize memory usage because only the necessary used data is created at any time (there are no duplicates that can fi ll up the memory). When the function returns df1, merged_gf is canceled but for the data now used by df1.

Here is the code for this utility function:

def merge_by_concat(df1, df2, merge_on):
    merged_gf = df1[merge_on]
    merged_gf = merged_gf.merge(df2, on=merge_on, how='left')
    new_columns = [col for col in list(merged_gf)
                   if col not in merge_on]
    df1[new_columns] = merged_gf[new_columns]
    return df1

After this necessary step, we proceed to program a new function to process the data. This time, we handle the prices data, a set of data containing the prices of each item by each store for all the weeks. Since it is important to figure out if we are talking about a new product appearing in a store or not, the function picks the first date of price availability (using the wm_yr_wk feature in the price table, representing the ID of the week) and it copies it to our feature template.

Here is the code for processing the release dates:

def calc_release_week(prices_df, end_train_day_x, predict_horizon):
    index_columns = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id',
'state_id']

    grid_df = pd.read_feather(f"grid_df_{end_train_day_x}_to_{end_train_
day_x + predict_horizon}.feather")

    release_df = prices_df.groupby(['store_id', 'item_id'])['wm_yr_wk'].
agg(['min']).reset_index()
    release_df.columns = ['store_id', 'item_id', 'release']

    grid_df = merge_by_concat(grid_df, release_df, ['store_id', 'item_
id'])

    del release_df
    grid_df = reduce_mem_usage(grid_df, verbose=False)
    gc.collect()

    grid_df = merge_by_concat(grid_df, calendar_df[['wm_yr_wk', 'd']],
['d'])

    grid_df = grid_df.reset_index(drop=True)
    grid_df['release'] = grid_df['release'] - grid_df['release'].min()
    grid_df['release'] = grid_df['release'].astype(np.int16)
  
    grid_df = reduce_mem_usage(grid_df, verbose=False)
    grid_df.to_feather(f"grid_df_{end_train_day_x}_to_{end_train_day_x +
predict_horizon}.feather")
    del(grid_df)
    gc.collect()

Through replicating Monsaraida’s top-ranking solution, we have showcased the effectiveness of using simple and straightforward methods to process release dates. However, to gain a deeper understanding of how this solution achieved an impressive private leaderboard score of 0.53583 by leveraging general features such as sales statistics, calendars, prices, and identifiers, we highly recommend referring to The Kaggle Workbook by Konrad Banachewicz and Luca Massaron. Both are Kaggle Grandmasters and highly acclaimed data scientists. We hope that this article has provided our readers with valuable insights and code to tackle similar forecasting competitions on Kaggle, and to inspire further exploration and experimentation in time series forecasting.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1