fbpx
Another batch of Think Stats notebooks Another batch of Think Stats notebooks
Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.... Another batch of Think Stats notebooks

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end.

If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 10 Notebook (Chapter 10 Solutions)
Chapter 11 Notebook (Chapter 11 Solutions)
Chapter 12 Notebook (Chapter 12 Solutions)

I’ll post the last two soon, but in the meantime you can see some of the more interesting exercises, and solutions, below.

Time series analysis

Load the data from “Price of Weed”.

In [2]:
transactions = pd.read_csv('mj-clean.csv', parse_dates=[5])
transactions.head()
Out[2]:
citystatepriceamountqualitydateppgstate.namelatlon
0AnnandaleVA1007.075high2010-09-0214.13Virginia38.830345-77.213870
1AuburnAL6028.300high2010-09-022.12Alabama32.578185-85.472820
2AustinTX6028.300medium2010-09-022.12Texas30.326374-97.771258
3BellevilleIL40028.300high2010-09-0214.13Illinois38.532311-89.983521
4BooneNC553.540high2010-09-0215.54North Carolina36.217052-81.687983

The following function takes a DataFrame of transactions and compute daily averages.

In [3]:
def GroupByDay(transactions, func=np.mean):
    ""Groups transactions by day and compute the daily mean ppg.

    transactions: DataFrame of transactions

    returns: DataFrame of daily prices
    ""
    grouped = transactions[['date', 'ppg']].groupby('date')
    daily = grouped.aggregate(func)

    daily['date'] = daily.index
    start = daily.date[0]
    one_year = np.timedelta64(1, 'Y')
    daily['years'] = (daily.date - start) / one_year

    return daily

The following function returns a map from quality name to a DataFrame of daily averages.

In [4]:
def GroupByQualityAndDay(transactions):
    ""Divides transactions by quality and computes mean daily price.

    transaction: DataFrame of transactions
    
    returns: map from quality to time series of ppg
    ""
    groups = transactions.groupby('quality')
    dailies = {}
    for name, group in groups:
        dailies[name] = GroupByDay(group)        

    return dailies

dailies is the map from quality name to DataFrame.

In [5]:
dailies = GroupByQualityAndDay(transactions)

The following plots the daily average price for each quality.

In [6]:
import matplotlib.pyplot as plt

thinkplot.PrePlot(rows=3)
for i, (name, daily) in enumerate(dailies.items()):
    thinkplot.SubPlot(i+1)
    title = 'Price per gram ($)' if i == 0 else ''
    thinkplot.Config(ylim=[0, 20], title=title)
    thinkplot.Scatter(daily.ppg, s=10, label=name)
    if i == 2: 
        plt.xticks(rotation=30)
        thinkplot.Config()
    else:
        thinkplot.Config(xticks=[])
Allen Downey

Allen Downey

I am a Professor of Computer Science at Olin College in Needham MA, and the author of Think Python, Think Bayes, Think Stats and several other books related to computer science and data science. Previously I taught at Wellesley College and Colby College, and in 2009 I was a Visiting Scientist at Google, Inc. I have a Ph.D. from U.C. Berkeley and B.S. and M.S. degrees from MIT. Here is my CV. I write a blog about Bayesian statistics and related topics called Probably Overthinking It. Several of my books are published by O’Reilly Media and all are available under free licenses from Green Tea Press.

1