Forecasting Amazon Holiday Sales: Gathering, Organizing & Exploration
Blogposted by Edward Kwartler November 24, 2016 Edward Kwartler
When I worked at Amazon more than 3 years ago our holiday season was always an adrenaline rush. My role was in customer service operations. We had record breaking volume each holiday season which means customer service was the front line for customer issues, technical support and even catching fraud. Amazon’s retail business, like all retail businesses in the US, is extremely cyclical. Working in operations and workforce planning means a lot of diligence is put into forecasting the Q4 spike.
To get an historical sense of the customer service spike (correlated with revenue) you can track the velocity on amazon’s public customer service forums which has a 10x spike in Q4 each year. This makes working in an amazon operational role one of the most lovingly frustrating and rewarding professional challenges I have ever been a part of. You have to hire many thousands of seasonal customer service agents worldwide, train them, plan for a holiday staffing season to minimize mandatory overtime (affecting morale and cost), likely create new processes for some new device or service and then wind it all down in an orderly fashion in Q1. When I was there, each holiday season amazon launched a new kindle, android app store or video on demand service which means agents need to be cross trained accordingly. The agents skills need to shift from answering “where’s my stuff?” before the holidays to “how does this kindle fire I got as a gift as a gift work?” The intense period starting with Black Friday and Cyber Monday, is one part chaos, beautiful orchestra music and 2 parts hard work so as not to fail our customers.
Since customer service is really a supply and demand problem a lot of deliberation was put into Q4 planning. Given some unknown launch of “x”, and organic growth the organization could expect a forecasted amount of emails, phone calls, and chats each half an hour among 25 queues globally. These contacts represent the demand curve on an ecosystem. It was the operational capacity planning analysis that forecasts the needed supply of agents to fill each of these contact channels worldwide.
In this post I show two methods for understanding a time series. Since we can’t truly know the amazon contact volume outside of the organization I will use time series decomposition and Holt Winters forecasting with amazon’s quarterly revenue. Our goal will be to understand the revenue cycles and forecast this holiday’s revenue which starts on Black Friday!
Gathering the Data
Surprisingly, finding amazon’s quarterly revenue in a free online service was painful. In the end I scraped it from www.wikinvest.com using rvest
.
In the code below I load the packages to extract and organize the quarterly revenue. Rvest
is used to harvest webpages. Pbapply
is a simple progress bar version of the base apply functions. Next qdap
has one of my favorite function msgub
for multiple global substitutions. The last two ggthemes
and ggplot2
are used for creating nice visualizations. I also add a global option to R so character strings are not categorical factors.
library(rvest)
library(pbapply)
library(qdap)
library(ggthemes)
library(ggplot2)
options(stringsAsFactors = F)
I found the information I was looking for on the page shown below. There is a table of annual quarterly revenue in the screenshot below for 2015 that I want to extract and organize into our time series.
Amazon’s annual wikinvest page can be found here.
Reviewing wikinvest’s URL I noticed how the final parameter represents the year. Thus, I need to create a URL for amazon from 1996 to 2015 to extract each pages’ information. Using paste0
and seq
the object all.years
is a text vector of the urls I intend to web scrape. Paste0
concatenates strings without separation. Seq
creates an integer sequence between two numbers. Here numbers 1996, then 1997 and so on until 2015. Since R recycles when needed, the base URL starting with “http…” repeats for each of the sequence numbers.
all.years<-paste0('http://www.wikinvest.com/stock/Amazon.com_(AMZN)/Data/Revenue/',seq(1996,2015))
After some trial and error I found the exact table containing the revenue numbers is called “nv_related_nums.” Further, cells 2 through 5 of that table have the revenue numbers themselves. So I wrote a custom function called rev.get
that reads a web page. Once read the node containing the exact table is selected and more specifically exact cells within the table are extracted. Within the node>table> cell in question, just the html_text
is extracted. The four different revenue numbers from cells 2, 3, 4, and 5 are then organized into data frame called revenues
.
Follow this pseudo-code to understand the xpath within the function:
[table id=17 /]
Pseudo code explanations for the rev.get
xpath.
rev.get<-function(x){
x<-read_html(x)
revenues<-data.frame(
q1=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[2]')),
q2=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[3]')), q3=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[4]')),
q4=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[5]'))
)
return(revenues)
}
Armed with this function I can now run my pre-constructed URLS through it. The pblapply
function accepts the all.years
URL and applies rev.get
to each one. I simply unlist
the resulting list to get a simple text vector containing all the information from the table cells for amazon’s quarterly revenue.
all.rev<-pblapply(all.years,rev.get)
all.rev<-unlist(all.rev)
To examine what we’ve captured just use the head
function on all.rev
. The resulting information is shown in the following table.
head(all.rev)
[table id=18 /]
The first six web scraped amazon revenue values.
Notice how each value is actually a text string and the web page has NA listed for some initial quarters? We will need to clean that up but first let’s organize the scraped data into a data frame. The code below creates a single column from 1 to the total length
of the scraped data (all.rev
) and another column containing all the scraped values.
all.df<-data.frame(ID=seq(1:length(all.rev)),string_revenue=all.rev)
Organizing the Data
When scraping data from the web a fair amount of data manipulation is often needed. Here we have text when we want numbers, there are NA values and since Amazon’s revenue started at 0 and has grown into the billions values are listed as “million” and “billion.” The next step is to clean it all up for analysis.
First I create an object called patterns
. This is a small text vector for “billion,” “million,” and “$.” Next another object called replacements
is created. Replacements
is an empty character vector which means it actually doesn’t contain anything, not even a space in between quotes. BTW I like to use single quotes when I code but you can use double quotes “”
for an empty character too.
patterns<-c(' billion',' million','$')
replacements<-c('')
I then pass patterns
, replacements
and the column of scraped revenue data, all.df$string_revenue
, into mgsub
. The function will search for any of the string patterns and then make multiple substitutions to the empty character. The last function parameter is the vector where the substitutions will occur. The code below adds another vector to the data frame called revenue
instead of string_revenue
.
all.df$revenue<-mgsub(patterns,replacements,all.df$string_revenue)
At this point the all.df$revenue
column needs to be changed to a numeric value instead of a character. This is easy enough with as.numeric
shown below. This will change the string “N.A” to logical NA by default so you will get a warning message but it is expected behavior.
all.df$revenue<-as.numeric(all.df$revenue)
Recall that the values have both millions and billions. Thus, the order of magnitude between the values needs to be changed to reflect reality. For example, the scraped data had “$27.9 million” and later “$25.36 billion.” Obviously, there is a big difference between these values so a change has to occur.
To account for this change I used an ifelse
statement. An ifelse
statement follows this logic:
The logical test in the code below uses
grepl
. I am searching for any string that contains “billion” in the column of raw scraped data. Grepl
will return a TRUE if the pattern is found at least once and a FALSE if the pattern is not found. As a result, the test identifies any rows in the data set where the revenue is in billions. Any FALSE rows occur when revenue was only in the millions.The next operation is what to do when a “billion” is found in the row. Here the column addressed is
all.df$revenue
. I multiply the revenue number by 1,000,000,000.If the billion was not found in the row, a FALSE from
grepl
, then I multiply the number by 1,000,000.In the end I can reword the
ifelse
statement to now read like this:
all.df$revenue<-ifelse(grepl('billion',
all.df$string_revenue)==1,
all.df$revenue*1000000000,
all.df$revenue*1000000)
Lastly to keep track of the periodicity in the data frame I added another column called period
. Since this is quarterly data I paste together “Q” and a sequence from 1 to 4. This is repeated for all years from 1996 to 2015. The tricky part is that you have to create four of 1996, four 1997 and so on. So within the rep
function the last parameter each=4
ensures the years are accurately pasted to the Q1 to Q4 sequence.
all.df$period<-paste("Q",seq(1:4),rep(seq(1996,2015), each=4), sep="_")
Finally we’re all set to start examining and forecasting Amazon’s holiday revenue!
Exploring the Data
Using the base summary function I like to examine the fundamental stats of the data. Not surprisingly the minimum value is a 0 where there was an NA that was changed and the max is the most recent fourth quarter.
summary(all.df$revenue)
To set up a base ggplot
pass in the data frame, all.df
, along with an X axis which is the ID column and a Y axis representing the cleaned revenue values. Once done in amzn.rev
then add another layer using geom_line
. Within geom_line
I like a color that stands out so I usually use a “red” or “darkred”. The next layer applies a pre-constructed aesthetic, theme_gdocs
to mimic google docs. The theme_gdocs
layer comes from ggthemes
and is useful for saving time. The last layer adds a title “Amzn Quarterly Rev” using ggtitle
.
amzn.rev amzn.rev + geom_line(color='darkred')+ theme_gdocs()+ggtitle('Amzn Quarterly Rev ')
Amazon’s quarterly revenue from 1996 to 2015.
Wow it looks like each fourth quarter has the peak I experienced when I worked at amazon! Visually examining this data there are some interesting patterns. First, there is a clear and steep upward trend. Second, there is strong periodicity. That means there is a repeating pattern of 4 that occurs. If we could obtain the data there is likely monthly, weekly and possibly daily period patterns that a forecaster could examine to obtain a good future forecast.
To look at an individual quarter I set up a small custom function called quarter.explore
. The function accepts a data frame and a string called qtr
. The default value of qtr
is “Q_1.” To examine another quarter change this parameter to “Q_2” and so on. The function will produce a basic revenue plot only for the quarter in question and will print a summary for each of the quarters.
Within quarter.explore
, the df$revenue
column is subset
using grepl
while referencing the pattern “Q_1” or some other qtr
value. Using the revenue subsection a plot
is created with a title
. Next, the revenue subsection is summarized and changed to a matrix so it will print vertically. I find it easier to read vertically.
quarter.explore<-function(df,qtr='Q_1'){
revenue<-subset(df$revenue,grepl(qtr,df$period)==T)
plot(revenue)
title(paste0(qtr,"amzn_rev"))
rev.sum<-as.matrix(summary(revenue))
return(rev.sum)}
This code will create the simple plot below and print the summary for the second quarter revenues.
quarter.explore(all.df,"Q_2")
Amazon historical revenue for each second quarter.
The summary of Amazon’s historical second quarter revenue is here:
If interested, this code will print all the revenue values across the data frame for the second quarter:
subset(all.df,grepl('Q_2',all.df$period)==T)
Looking at this quarterly comparison view is startling. There is a ton of rapid growth. As a former employee I knew we were growing like mad but was working too hard to really comprehend the actual outcome!
A screenshot of the Amazon’s raw Q2 revenue.
Amazon’s 2015 second quarter revenue is more than 800 times the 1997 Q2 revenue!
Next Time
So far I showed you how to collect the data, organize then clean it and finally create a simple time series plot. In the next post we will actually change the data into a time series class object. From there I will show you how to perform time series decomposition and a Holt Winters forecast to backup my claim that amazon will make $41B this quarter! In the meantime here is one of the plots we will create that deconstructs Amazon’s quarterly revenue. The code in the next post will show you how to capture the trend, seasonality and the unaccounted “random” values.
A time series decomposition of Amazon’s quarterly revenue.
©ODSC2016

Edward Kwartler
Working at Liberty Mutual I shape the organization's strategy and vision concerning next generation vehicles. This includes today's advanced vehicle ADAS features and the self driving cars of the (near) future. I get to work with exciting startups, MIT labs, government officials, automotive leaders and various data scientists to understand how Liberty can thrive in this rapidly changing environment. Plus I get to internally incubate ideas and foster an entrepreneurial ethos! Specialties: Data Science, Text Mining, IT service management, Process improvement and project management, business analytics
George R.R. Martin And Other Authors Sue OpenAI
AI and Data Science Newsposted by ODSC Team Sep 22, 2023
9 Open Source LLMs and Agents to Watch
Modelingposted by ODSC Team Sep 21, 2023
Conversational Data Analysis: Cutting Through the Noise to Find the Real Deal
Modelingposted by ODSC Community Sep 21, 2023