Seven Python Kernels from Kaggle You Need to See Right Now
The ability to post and share kernels is probably my favorite thing about Kaggle. Learning from other users’ kernels has often provided inspiration for a number of my own projects. I also appreciate the attention to detail and descriptions provided by some users in their code as well. This... Read more
Why I like the Convolution Theorem
The convolution theorem (or theorems: it has versions that some people would call distinct species and other would describe as mere subspecies) is another almost obviously almost true result, this time about asymptotic efficiency. It’s an asymptotic version of the Cramér–Rao bound. Suppose (hattheta) is an efficient estimator of... Read more
The inspiration for this post is a joint venture by both me and my husband, and its genesis lies more than 15 years in our past. One of the recurring conversations we have in our relationship (all long-term relationships have these, right?!) is about song lyrics and place names.... Read more
Third batch of notebooks for Think Stats
As I mentioned in the previous post and the one before that, I am getting ready to teach Data Science in the spring, so I am going back through Think Stats and updating the Jupyter notebooks.  I am done with Chapters 1 through 9 now. If you are reading the book, you can get... Read more
Time Series Analysis with Generalized Additive Models
Whenever you spot a trend plotted against time, you would be looking at a time series. The de facto choice for studying financial market performance and weather forecasts, time series are one of the most pervasive analysis techniques because of its inextricable relation to time—we are always interested to... Read more
As a Data Scientist that works on Feed Personalization, I find it it important to stay up to date with the current state of Machine Learning and its applications. Most of the time, using some of the better-known recommendation algorithms yields good initial results; however, sometimes a change in the... Read more
In this post we will describe how to evaluate a predictive model. Why bother creating complex predictive models if 5% of the customers will churn anyway? Because a predictive model will rank our clients based on the probability that they  will abandon the company. It helps answer these two questions: 1.... Read more
Do Resampling Estimates Have Low Correlation to the Truth?
The Answer May Shock You. One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV results and the true error rate. Let’s look at this with some simulated data. While this assertion is... Read more
This is a two-part series about using machine learning to hack my taste in music. In this first piece, I applied unsupervised learning techniques and tools on Pandora data to analyze songs that I like. The second part, which will be published soon, is about using supervised on Spotify data to... Read more
It wasn’t an overbooking problem. United Airlines was trying to move four flight crew members to the next airport. They forced passengers to get off the plane with the consequences we saw on the video from last Sunday, but don’t take our word for it. Let’s talk data. An elaborate... Read more