Cohort-based models are an alternative to time series models when it comes to forecasting of paid subscriptions
TLDR on Cohort-Based Models
A company offering subscriptions (e.g. Wix, Spotify, Dropbox, Grammarly) can forecast its future paid subscriptions using time series models, like ARIMA or Prophet. These models are trained on time series data containing subscriptions by dates.
An interesting alternative is to reformat the data to have the subscriptions by users’ registration dates and purchase dates, basically transforming the time series data into tabular data. This makes it possible to apply regression models, like GLM or GBM, which often produce better forecasts and also offer additional insights regarding the attribution of future subscriptions to cohorts of users. These models are called cohort-based models.
What is a cohort?
By dictionary definition, a cohort is a “group of people with a shared characteristic, usually age”. In our case with cohort-based models, the users registered on a given date represent a cohort. For example, the “cohort of 2019–01–01” consists of all users registered on 2019–01–01. Likewise, the “cohort of 2019” includes all users registered during 2019.
A few more definitions related to cohort-based models before we go further:
- Registration date: the date when the user has registered;
- Upgrade date (Purchase date): the date when the user purchased a premium subscription;
- Age of a cohort/user: days since the registration date;
- Premiums: paid subscriptions. A user first registers then buys a subscription, sometimes on the same day, other times only after using the product at no cost for a while. Many companies like Wix, Spotify, Dropbox have a “freemium” business model or offer free trial periods for their product.
The figure below illustrates the number of premiums generated by one hypothetical cohort registered on 2019–01–01, for the first 30 days after the registration.
Figure 1. Premiums by upgrade date & age for the cohort of the registration date of 2019–01–01. (Image by author)
Cohorts behave (almost) similarly
Usually, cohorts behave similarly. When plotting the premiums of multiple cohorts of different registration dates by upgrade date, we can observe that they have similar shapes.
The similarity is more evident when we plot the same cohorts by age.
Once again, if we plot the same cohorts by age, but now for 365 days after registration, we can observe some long tails, meaning that cohorts generate premiums long after registration.
Figure 4. Cohorts of different registration dates by age — first 365 days. (Image by author)
There are few important characteristics to be observed in the figures above:
- More premiums are generated in the first days after registration;
- The pace at which new users are purchasing premiums is declining exponentially as they are “getting older”;
- A substantial number of subscriptions are being purchased a long time after users registered — “long tails”.
Forecasting premiums by cohorts
Let’s pretend that today is 2020–01–01 and we want to forecast new premiums for the 90 days ahead. Eventually, the premiums will come from existing cohorts (of users registered until today) and future cohorts (of users registered during the quarter starting tomorrow).
First, let’s take all existing cohorts registered during the last 365 days and call them recent.
The threshold of 365 days is arbitrarily chosen here. For some companies, depending on the tail, recent cohorts are those registered during the last 90 days, for others, it can be 2 years. The idea is to break down existing cohorts into recent and old and apply different prediction models. We will return to this later.
When plotting the recent cohorts, they may look like in the figure below. Our task is to forecast after today’s date, marked in red. The lines on the right of the red mark are unknown to us. That’s what we want to predict.
Figure 5. Recent cohorts. (Image by author)
Forecasting premiums produced by recent cohorts means extrapolating the “tails” of these cohorts into the future. For example, as you can see in the figure below, for an existing cohort registered on 2019–12–15, we’ll have to come up with the red dotted line to guess the true grey line which is unknown to us, but hopefully can be learned from older cohorts. The existing recent cohorts are also the easiest to forecast because we know more about them, most importantly, we know their size and we know their incipient dynamic.
We’ll also have to come up with an estimate for future cohorts, born after today. These are marked in blue in the figure below. We don’t know too much about future cohorts, maybe except that we’ll have a new cohort every day during the next 90 days. The cohort born tomorrow will have 90 days to generate premiums, while the cohort registered on the last day of the forecast period will have only one day to generate premiums. Hopefully, these cohorts share the same features as past cohorts, and drawing the blue shapes is an exercise closer to data science than it is to painting.
Figure 7. Recent and future cohorts. (Image by author)
Unless our product (or company) is younger than one year (the threshold we selected for recent), we eventually got to have old cohorts too. These are existing cohorts, registered before the recent ones and long before the forecast period starts. Let’s add them to the plot too and mark them in orange. They appear as a multitude of overlapping lines slightly above zero. There can be many of them. For example, if the history of the product/company starts in 2010, there will be about 3,285 cohort lines (9 years * 365 registration dates). Despite having small numbers, the total premiums generated by old cohorts can account for a substantial portion out of the future revenue.
Figure 8. Old, recent, and future cohorts. (Image by author)
Let’s now aggregate all cohorts by upgrade date and plot their totals. These are some nice-looking time series. A few observations to make:
- Premiums by recent cohorts are dropping with time (remember the exponential decay by age).
- Future cohorts account for more out of the total of future premiums.
- Old cohorts may account for a substantial part out of the total.
Figure 9. Total premiums by old, recent, and future cohorts. (Image by author)
Let’s go one step further and sum up all three parts: old, recent, and future. This will get us the time series of total premiums, our target. That is the black bold line in the figure below.
Figure 10. Total of totals of premiums by old, recent, and future cohorts. (Image by author)
The technique described above is exactly what we do to forecast premiums. We break down cohorts into old, recent, and future. And for each part, we apply a separate regression model. We do that because these models are different in terms of distribution and available features. Each of these models predicts premiums for many cohorts (registration dates). And for each cohort, it predicts premiums for many upgrade dates in the future. We then aggregate the predictions of each model by upgrade date to obtain a time series for each part: old, recent, and future. In the end, we sum up all three parts together to obtain the total premiums by upgrade date. That is actually what a time series model would get us — premiums by future dates. Well, the cohort-based approach does the same in a more complicated way, which has its advantages when it comes to the accuracy of forecasts and additional insights about users.
In conclusion, we just proved that using the cohort-based approach we transformed the time-series task of forecasting into a regression task.
Time-series properties of the target to forecast
The time series of totals has some interesting properties that we need to model:
- Seasonality (weekly, yearly)
- Holidays (e.g. Christmas, Independence Day, Easter)
- Sales spikes during special events (e.g. Black Friday, Cyber Monday)
Figure 11. Time-series properties: seasonality, holidays. (Image by author)
And of course, when zooming out a bit, we may discover that there is also the trend that needs to be included in our models too (see figure below).
Figure 12. Time-series properties: trend. (Image by author)
This text gives a general understanding of the concept of cohort-based models in the context of premiums forecasting. In the next sections of this report at ODSC, I will talk about implementation. I will provide some important recommendations and technical details on how to better build these regression models and what are some key points to pay attention to.
About the author/ODSC West 2021 speaker on Cohort-Based Models
Nicolai Vicol is a Data Scientist at Wix, where he specializes in forecasting of new users, paid subscriptions, cash flows and generally everything related to time-series. He started his career as a quant in an investment bank, then switched to data science and IT, accumulating in total 9 years of experience in the field. Areas of interest: time series and forecasting, but also recommendation systems, search systems and operation research.