fbpx
The 5 Skills You Need to Start Machine Learning
With any new skill, hobby, or career path, you likely have more questions than answers. How do I get started? What skills do I need to focus on first? What sources do I trust to learn all of this? Data science and machine learning are no... Read more
A Quick Look Into Bootstrapping
Executive Summary As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample. Learn to bootstrap in R. Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post). [Related Article: Discovering... Read more
Hierarchical Bayesian Models in R
Hierarchical approaches to statistical modeling are integral to a data scientist’s skill set because hierarchical data is incredibly common. In this article, we’ll go through the advantages of employing hierarchical Bayesian models and go through an exercise building one in R. If you’re unfamiliar with Bayesian... Read more
Why Do Tree Ensembles Work?
Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We will survey work aimed at understanding the statistical properties of decision tree ensembles, with the goal of explaining why they work. An elementary probabilistic motivation... Read more
Confidence Intervals for Data Scientists
Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the... Read more
How to Play Fantasy Sports Strategically (and Win)
Daily Fantasy Sports is a multibillion-dollar industry with millions of annual users. The Imperial College Business School’s Martin Haugh created a framework to best those users by modeling what they’ll do and constructing a team based on it. Haugh presented his research on how to play... Read more
Thomas Wiecki of Quantopian on ‘Minding the Gap’ Between Statistics and Machine Learning at ODSC Europe 2018
Key Takeaways: It’s important for data scientists to understand the so-called “gap” between statistics and machine learning, and how there actually is a lot of commonality between the two; it’s just a matter of how you look at things. PyMC3 is a very useful probabilistic programming... Read more
Exploring the Central Limit Theorem in R
The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand. In this article, we’ll go over some basic theory of the CLT, explain why it’s important for data scientists, and present some... Read more
Mine Like Amazon with Market Basket Analysis
Pattern mining is an incredibly simple but powerful technique for discovering cooccurrences in large datasets. The most common approach to find those patterns is Market Basket Analysis, which is frequently pointed out as the method Amazon leverages for their “users also purchased” feature. Of course, that’s... Read more
Another batch of Think Stats notebooks
Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial... Read more