fbpx
The Turf War Between Causality and Correlation In Data Science: Which One Is More Important?
Data scientists have tried to differentiate causality from correlation. Last month alone, I’ve seen 20+ posts referencing the catchphrase “correlation is not causality.” What they actually want to say is correlation is not as good as causality. [Related Article: Discovering 135 Nights of Sleep with Data,... Read more
Regression Discontinuity Design: The Crown Jewel of Causal Inference
Background In a series of posts (here, here, here, here and here), I’ve explained why and how we should run social experimentations. However, it’s not possible to do social experiments all the time, and researchers have to identify causal effects by other observational and quasi-experimental methods. [Related Article: Causal Inference: An... Read more
A Quick Look Into Bootstrapping
Executive Summary As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample. Learn to bootstrap in R. Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post). [Related Article: Discovering... Read more
135 Nights of Sleep with Data, Anomaly Detection, and Time Series
In this article, I look at data from 135 nights of sleep and use anomaly detection and time series data to understand the results. Three things are certain in life: death, taxes, and sleeping. Here, we’ll talk about the latest. Every night*, us humans, after a... Read more
3 Regression Pitfalls in Business Applications
Regression is a fantastic tool for aiding business decisions. The traditional purpose of a regression model is to find the mean value of a dependent variable given a set of independent variables. In a business, this purpose should be expanded to include the reduction of uncertainty... Read more
Hierarchical Bayesian Models in R
Hierarchical approaches to statistical modeling are integral to a data scientist’s skill set because hierarchical data is incredibly common. In this article, we’ll go through the advantages of employing hierarchical Bayesian models and go through an exercise building one in R. If you’re unfamiliar with Bayesian... Read more
The Empirical Derivation of the Bayesian Formula
Deep learning has been made practical through modern computing power, but it is not the only technique benefiting from this large increase in power. Bayesian inference is up and coming technique whose recent progress is powered by the increase in computing power. We can explain the... Read more
Why Do Tree Ensembles Work?
Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We will survey work aimed at understanding the statistical properties of decision tree ensembles, with the goal of explaining why they work. An elementary probabilistic motivation... Read more
The Importance of P-Values in Data Science
The field of data science makes use of concepts from a variety of disciplines, particularly computer science, mathematics, and applied statistics. One term that keeps popping up in data science circles (including many interviews for data scientist employment positions) is “p-value” which comes from statistics. This... Read more
Confidence Intervals for Data Scientists
Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the... Read more