## The Turf War Between Causality and Correlation In Data Science: Which One Is More Important?

ModelingStatisticscausalityCorrelationposted by Leihua Ye January 6, 2020

Data scientists have tried to differentiate causality from correlation. Last month alone, I’ve seen 20+ posts referencing the catchphrase “correlation is not causality.” What they actually want to say is correlation is not as good as causality. [Related Article: Discovering 135 Nights of Sleep with Data,... Read more

## Regression Discontinuity Design: The Crown Jewel of Causal Inference

ModelingStatisticscasual inferenceposted by Leihua Ye December 17, 2019

Background In a series of posts (here, here, here, here and here), I’ve explained why and how we should run social experimentations. However, it’s not possible to do social experiments all the time, and researchers have to identify causal effects by other observational and quasi-experimental methods. [Related Article: Causal Inference: An... Read more

## A Quick Look Into Bootstrapping

Machine LearningModelingRStatisticsTools & LanguagesbootstrappingStatisticsposted by Leihua Ye December 3, 2019

Executive Summary As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample. Learn to bootstrap in R. Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post). [Related Article: Discovering... Read more

## 135 Nights of Sleep with Data, Anomaly Detection, and Time Series

ModelingPythonRStatisticsTools & Languagesanomaly detectionTime Seriesposted by Juan De Dios Santos November 4, 2019

In this article, I look at data from 135 nights of sleep and use anomaly detection and time series data to understand the results. Three things are certain in life: death, taxes, and sleeping. Here, we’ll talk about the latest. Every night*, us humans, after a... Read more

## 3 Regression Pitfalls in Business Applications

Business + ManagementModelingStatisticsregressionposted by Jacey Heuer October 21, 2019

Regression is a fantastic tool for aiding business decisions. The traditional purpose of a regression model is to find the mean value of a dependent variable given a set of independent variables. In a business, this purpose should be expanded to include the reduction of uncertainty... Read more

## Hierarchical Bayesian Models in R

Hierarchical approaches to statistical modeling are integral to a data scientist’s skill set because hierarchical data is incredibly common. In this article, we’ll go through the advantages of employing hierarchical Bayesian models and go through an exercise building one in R. If you’re unfamiliar with Bayesian... Read more

## The Empirical Derivation of the Bayesian Formula

Guest contributorMachine LearningModelingStatisticsbayseianMachine Learningposted by Jannes Klaas June 18, 2019

Deep learning has been made practical through modern computing power, but it is not the only technique benefiting from this large increase in power. Bayesian inference is up and coming technique whose recent progress is powered by the increase in computing power. We can explain the... Read more

## Why Do Tree Ensembles Work?

Guest contributorMachine LearningModelingStatisticsMachine LearningStatisticsposted by Joe Ross April 10, 2019

Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We will survey work aimed at understanding the statistical properties of decision tree ensembles, with the goal of explaining why they work. An elementary probabilistic motivation... Read more

## The Importance of P-Values in Data Science

ModelingStatisticsposted by Daniel Gutierrez, ODSC February 26, 2019

The field of data science makes use of concepts from a variety of disciplines, particularly computer science, mathematics, and applied statistics. One term that keeps popping up in data science circles (including many interviews for data scientist employment positions) is “p-value” which comes from statistics. This... Read more

## Confidence Intervals for Data Scientists

ModelingStatisticsStatisticsposted by Daniel Gutierrez, ODSC January 17, 2019

Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the... Read more