fbpx
Data Imputation: Beyond Mean, Median and Mode
Types of Missing Data 1.Unit Non-Response Unit Non-Response refers to entire rows of missing data. An example of this might be people who choose not to fill out the census. Here, we don’t necessarily see Nans in our data, but we know there are values missing because we know... Read more
From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 2
What’s the best way to model the probability that one player beats another in a digital game a client of your employer designed? This is the second of a two-part series in which you’re a data scientist at a fictional mobile game development company that makes money by monetizing... Read more
From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 1
Imagine you’re a data scientist at an online mobile multiplayer competition platform. Your bosses have a vested interest in paying people with our skillset to predict game outcomes for a variety of commercial applications they profit from, for example, setting odds and sharing better insights with game developers on... Read more
The Turf War Between Causality and Correlation In Data Science: Which One Is More Important?
Data scientists have tried to differentiate causality from correlation. Last month alone, I’ve seen 20+ posts referencing the catchphrase “correlation is not causality.” What they actually want to say is correlation is not as good as causality. [Related Article: Discovering 135 Nights of Sleep with Data, Anomaly Detection, and... Read more
Regression Discontinuity Design: The Crown Jewel of Causal Inference
Background In a series of posts (here, here, here, here and here), I’ve explained why and how we should run social experimentations. However, it’s not possible to do social experiments all the time, and researchers have to identify causal effects by other observational and quasi-experimental methods. [Related Article: Causal Inference: An Indispensable Set of... Read more
A Quick Look Into Bootstrapping
Executive Summary As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample. Learn to bootstrap in R. Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post). [Related Article: Discovering 135 Nights of... Read more
Discovering 135 Nights of Sleep with Data, Anomaly Detection, and Time Series
In this article, I look at data from 135 nights of sleep and use anomaly detection and time series data to understand the results. Three things are certain in life: death, taxes, and sleeping. Here, we’ll talk about the latest. Every night*, us humans, after a long day of... Read more
3 Common Regression Pitfalls in Business Applications
Regression is a fantastic tool for aiding business decisions. The traditional purpose of a regression model is to find the mean value of a dependent variable given a set of independent variables. In a business, this purpose should be expanded to include the reduction of uncertainty in future events.... Read more
Hierarchical Bayesian Models in R
Hierarchical approaches to statistical modeling are integral to a data scientist’s skill set because hierarchical data is incredibly common. In this article, we’ll go through the advantages of employing hierarchical Bayesian models and go through an exercise building one in R. If you’re unfamiliar with Bayesian modeling, I recommend... Read more
The Empirical Derivation of the Bayesian Formula
Editor’s note: James is a speaker for ODSC London this November! Be sure to check out his talk, “The How, Why, and When of Replacing Engineering Work with Compute Power” there. Deep learning has been made practical through modern computing power, but it is not the only technique benefiting... Read more