Incremental Development of PyMC Models
PyMC is a powerful tool for doing Bayesian statistics, but getting started can be intimidating. This article presents an example that I think is a good starting place, and demonstrates a method I use to develop and test models incrementally. Games like hockey and soccer are... Read more
Finding That Needle! Isolation Forests for Anomaly Detection
One of the best parts of data science is that algorithms developed for one application turn up in other applications they were not originally designed for! This is very true in the world of fraud and anomaly detection. Many algorithms have their foundation elsewhere but find... Read more
Data Science’s Role in Anomaly Detection
Anomalies. Oxford dictionary defines them as things that deviate from what is normal or expected. No matter what field you are in, they seem to pop up and occur without warning. In the realm of data, anomalies can lead to incorrect or out-of-date decisions to be... Read more
Introducing PyMC Labs: Saving the World with Bayesian Modeling
After I left Quantopian in 2020, something interesting happened: various companies contacted me inquiring about consulting to help them with their PyMC3 models. Usually, I don’t hear how people are using PyMC3 — they mostly show up on GitHub or Discourse when something isn’t working right. So, hearing about all these really... Read more
The Bayesians are Coming! The Bayesians are Coming, to Time Series
Editor’s note: Aric is a speaker for ODSC West 2020 this October. Check out his talk, “The Bayesians are Coming! The Bayesians are Coming, to Time Series,” there!  Forecasting has applications across all industries. From needing to predict future values of sales for a product line,... Read more
Data Imputation: Beyond Mean, Median and Mode
This posting is titled Data Imputation: Beyond Mean, Median, and Mode. Types of Missing Data 1.Unit Non-Response Unit Non-Response refers to entire rows of missing data. An example of this might be people who choose not to fill out the census. Here, we don’t necessarily see... Read more
From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 2
What’s the best way to model the probability that one player beats another in a digital game a client of your employer designed? This is the second of a two-part series in which you’re a data scientist at a fictional mobile game development company that makes... Read more
From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 1
From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 1. Imagine you’re a data scientist at an online mobile multiplayer competition platform. Your bosses have a vested interest in paying people with our skillset to predict game outcomes for a variety of... Read more
The Turf War Between Causality and Correlation In Data Science: Which One Is More Important?
Data scientists have tried to differentiate causality from correlation. Last month alone, I’ve seen 20+ posts referencing the catchphrase “correlation is not causality.” What they actually want to say is correlation is not as good as causality. [Related Article: Discovering 135 Nights of Sleep with Data,... Read more
Regression Discontinuity Design: The Crown Jewel of Causal Inference
Background In a series of posts (here, here, here, here and here), I’ve explained why and how we should run social experimentations. However, it’s not possible to do social experiments all the time, and researchers have to identify causal effects by other observational and quasi-experimental methods. [Related Article: Causal Inference: An... Read more