The Empirical Derivation of the Bayesian Formula
Editor’s note: James is a speaker for ODSC London this November! Be sure to check out his talk, “The How, Why, and When of Replacing Engineering Work with Compute Power” there. Deep learning has been made practical through modern computing power, but it is not the... Read more
Why Do Tree Ensembles Work?
Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We will survey work aimed at understanding the statistical properties of decision tree ensembles, with the goal of explaining why they work. An elementary probabilistic motivation... Read more
The Importance of P-Values in Data Science
The field of data science makes use of concepts from a variety of disciplines, particularly computer science, mathematics, and applied statistics. One term that keeps popping up in data science circles (including many interviews for data scientist employment positions) is “p-value” which comes from statistics. This... Read more
Confidence Intervals for Data Scientists
Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the... Read more
How to Play Fantasy Sports Strategically (and Win)
Daily Fantasy Sports is a multibillion-dollar industry with millions of annual users. The Imperial College Business School’s Martin Haugh created a framework to best those users by modeling what they’ll do and constructing a team based on it. Haugh presented his research on how to play... Read more
Thomas Wiecki of Quantopian on ‘Minding the Gap’ Between Statistics and Machine Learning at ODSC Europe 2018
Key Takeaways: It’s important for data scientists to understand the so-called “gap” between statistics and machine learning, and how there actually is a lot of commonality between the two; it’s just a matter of how you look at things. PyMC3 is a very useful probabilistic programming... Read more
Exploring the Central Limit Theorem in R
The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand. In this article, we’ll go over some basic theory of the CLT, explain why it’s important for data scientists, and present some... Read more
Mine Like Amazon with Market Basket Analysis
Pattern mining is an incredibly simple but powerful technique for discovering cooccurrences in large datasets. The most common approach to find those patterns is Market Basket Analysis, which is frequently pointed out as the method Amazon leverages for their “users also purchased” feature. Of course, that’s... Read more
Building a Microservice for Twitter Real-Time Data Collection and Sentiment Analysis.
First of all, I would like to point out that the skill of building MVP and microservices for a data scientist is extremely useful! When you can build a prototype and test it in a working environment it just feels so much better and allows you... Read more
Intro to Ontologies
Statistical methods are inarguably the hottest approach to evaluating datasets at scale right now. They’re not without their weaknesses though – they’re ultimately heuristic, and some methods like neural networks require tremendous amounts of data to create a well-fitted model. That’s where semantics come in. If... Read more