Another batch of Think Stats notebooks Another batch of Think Stats notebooks
Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am... Another batch of Think Stats notebooks

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end.

If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 10 Notebook (Chapter 10 Solutions)
Chapter 11 Notebook (Chapter 11 Solutions)
Chapter 12 Notebook (Chapter 12 Solutions)

I’ll post the last two soon, but in the meantime you can see some of the more interesting exercises, and solutions, below.

Time series analysis

Load the data from “Price of Weed”.

In [2]:
transactions = pd.read_csv('mj-clean.csv', parse_dates=[5])
transactions.head()
Out[2]:
city state price amount quality date ppg state.name lat lon
0 Annandale VA 100 7.075 high 2010-09-02 14.13 Virginia 38.830345 -77.213870
1 Auburn AL 60 28.300 high 2010-09-02 2.12 Alabama 32.578185 -85.472820
2 Austin TX 60 28.300 medium 2010-09-02 2.12 Texas 30.326374 -97.771258
3 Belleville IL 400 28.300 high 2010-09-02 14.13 Illinois 38.532311 -89.983521
4 Boone NC 55 3.540 high 2010-09-02 15.54 North Carolina 36.217052 -81.687983

The following function takes a DataFrame of transactions and compute daily averages.

In [3]:
def GroupByDay(transactions, func=np.mean):
    ""Groups transactions by day and compute the daily mean ppg.

    transactions: DataFrame of transactions

    returns: DataFrame of daily prices
    ""
    grouped = transactions[['date', 'ppg']].groupby('date')
    daily = grouped.aggregate(func)

    daily['date'] = daily.index
    start = daily.date[0]
    one_year = np.timedelta64(1, 'Y')
    daily['years'] = (daily.date - start) / one_year

    return daily

The following function returns a map from quality name to a DataFrame of daily averages.

In [4]:
def GroupByQualityAndDay(transactions):
    ""Divides transactions by quality and computes mean daily price.

    transaction: DataFrame of transactions
    
    returns: map from quality to time series of ppg
    ""
    groups = transactions.groupby('quality')
    dailies = {}
    for name, group in groups:
        dailies[name] = GroupByDay(group)        

    return dailies

dailies is the map from quality name to DataFrame.

In [5]:
dailies = GroupByQualityAndDay(transactions)

The following plots the daily average price for each quality.

In [6]:
import matplotlib.pyplot as plt

thinkplot.PrePlot(rows=3)
for i, (name, daily) in enumerate(dailies.items()):
    thinkplot.SubPlot(i+1)
    title = 'Price per gram ($)' if i == 0 else ''
    thinkplot.Config(ylim=[0, 20], title=title)
    thinkplot.Scatter(daily.ppg, s=10, label=name)
    if i == 2: 
        plt.xticks(rotation=30)
        thinkplot.Config()
    else:
        thinkplot.Config(xticks=[])
Allen Downey

Allen Downey

I am a Professor of Computer Science at Olin College in Needham MA, and the author of Think Python, Think Bayes, Think Stats and several other books related to computer science and data science. Previously I taught at Wellesley College and Colby College, and in 2009 I was a Visiting Scientist at Google, Inc. I have a Ph.D. from U.C. Berkeley and B.S. and M.S. degrees from MIT. Here is my CV. I write a blog about Bayesian statistics and related topics called Probably Overthinking It. Several of my books are published by O’Reilly Media and all are available under free licenses from Green Tea Press.