fbpx
Discovering 135 Nights of Sleep with Data, Anomaly Detection, and Time Series
In this article, I look at data from 135 nights of sleep and use anomaly detection and time series data to understand the results. Three things are certain in life: death, taxes, and sleeping. Here, we’ll talk about the latest. Every night*, us humans, after a long day of... Read more
3 Common Regression Pitfalls in Business Applications
Regression is a fantastic tool for aiding business decisions. The traditional purpose of a regression model is to find the mean value of a dependent variable given a set of independent variables. In a business, this purpose should be expanded to include the reduction of uncertainty in future events.... Read more
Hierarchical Bayesian Models in R
Hierarchical approaches to statistical modeling are integral to a data scientist’s skill set because hierarchical data is incredibly common. In this article, we’ll go through the advantages of employing hierarchical Bayesian models and go through an exercise building one in R. If you’re unfamiliar with Bayesian modeling, I recommend... Read more
The Empirical Derivation of the Bayesian Formula
Editor’s note: James is a speaker for ODSC London this November! Be sure to check out his talk, “The How, Why, and When of Replacing Engineering Work with Compute Power” there. Deep learning has been made practical through modern computing power, but it is not the only technique benefiting... Read more
Why Do Tree Ensembles Work?
Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We will survey work aimed at understanding the statistical properties of decision tree ensembles, with the goal of explaining why they work. An elementary probabilistic motivation for ensemble methods... Read more
The Importance of P-Values in Data Science
The field of data science makes use of concepts from a variety of disciplines, particularly computer science, mathematics, and applied statistics. One term that keeps popping up in data science circles (including many interviews for data scientist employment positions) is “p-value” which comes from statistics. This term is frequently... Read more
Confidence Intervals for Data Scientists
Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the basics of confidence... Read more
How to Play Fantasy Sports Strategically (and Win)
Daily Fantasy Sports is a multibillion-dollar industry with millions of annual users. The Imperial College Business School’s Martin Haugh created a framework to best those users by modeling what they’ll do and constructing a team based on it. Haugh presented his research on how to play Fantasy sports strategically... Read more
Thomas Wiecki of Quantopian on ‘Minding the Gap’ Between Statistics and Machine Learning at ODSC Europe 2018
Key Takeaways: It’s important for data scientists to understand the so-called “gap” between statistics and machine learning, and how there actually is a lot of commonality between the two; it’s just a matter of how you look at things. PyMC3 is a very useful probabilistic programming framework for Python.... Read more
Exploring the Central Limit Theorem in R
The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand. In this article, we’ll go over some basic theory of the CLT, explain why it’s important for data scientists, and present some R code that... Read more