New notebooks for Think Stats New notebooks for Think Stats
Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am... New notebooks for Think Stats

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end.

If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 1 Notebook (Chapter 1 Solutions)
Chapter 2 Notebook (Chapter 2 Solutions)
Chapter 3 Notebook (Chapter 3 Solutions)

I’ll post more soon, but in the meantime you can see some of the more interesting exercises, and solutions, below.

Exercise: Something like the class size paradox appears if you survey children and ask how many children are in their family. Families with many children are more likely to appear in your sample, and families with no children have no chance to be in the sample.
Use the NSFG respondent variable numkdhh to construct the actual distribution for the number of children under 18 in the respondents’ households.
Now compute the biased distribution we would see if we surveyed the children and asked them how many children under 18 (including themselves) are in their household.
Plot the actual and biased distributions, and compute their means.
In [36]:
resp = nsfg.ReadFemResp()
In [37]:
# Solution

pmf = thinkstats2.Pmf(resp.numkdhh, label='numkdhh')
In [38]:
# Solution

thinkplot.Pmf(pmf)
thinkplot.Config(xlabel='Number of children', ylabel='PMF')
In [39]:
# Solution

biased = BiasPmf(pmf, label='biased')
In [40]:
# Solution

thinkplot.PrePlot(2)
thinkplot.Pmfs([pmf, biased])
thinkplot.Config(xlabel='Number of children', ylabel='PMF')
In [41]:
# Solution

pmf.Mean()
Out[41]:
1.0242051550438309
In [42]:
# Solution

biased.Mean()
Out[42]:
2.4036791006642821
Exercise: I started this book with the question, “Are first babies more likely to be late?” To address it, I computed the difference in means between groups of babies, but I ignored the possibility that there might be a difference between first babies and others for the same woman.
To address this version of the question, select respondents who have at least live births and compute pairwise differences. Does this formulation of the question yield a different result?
Hint: use nsfg.MakePregMap:
In [43]:
live, firsts, others = first.MakeFrames()
In [44]:
preg_map = nsfg.MakePregMap(live)
In [45]:
# Solution

hist = thinkstats2.Hist()

for caseid, indices in preg_map.items():
    if len(indices) >= 2:
        pair = preg.loc[indices[0:2]].prglngth
        diff = np.diff(pair)[0]
        hist[diff] += 1
In [46]:
# Solution

thinkplot.Hist(hist)
Allen Downey

Allen Downey

I am a Professor of Computer Science at Olin College in Needham MA, and the author of Think Python, Think Bayes, Think Stats and several other books related to computer science and data science. Previously I taught at Wellesley College and Colby College, and in 2009 I was a Visiting Scientist at Google, Inc. I have a Ph.D. from U.C. Berkeley and B.S. and M.S. degrees from MIT. Here is my CV. I write a blog about Bayesian statistics and related topics called Probably Overthinking It. Several of my books are published by O’Reilly Media and all are available under free licenses from Green Tea Press.